arxiv: 2506.18088 · v2 · submitted 2025-06-22 · 💻 cs.RO · cs.AI· cs.CL· cs.CV· cs.MA

Recognition: 2 theorem links

· Lean Theorem

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen , Zanxin Chen , Baijun Chen , Zijian Cai , Yibin Liu , Zixuan Li , Qiwei Liang , Xianliang Lin

show 18 more authors

Yiheng Ge Zhenyu Gu Weiliang Deng Yubin Guo Tian Nian Xuanbing Xie Qiangyu Chen Kailun Su Tianling Xu Guodong Liu Mengkang Hu Huan-ang Gao Kaixuan Wang Zhixuan Liang Yusen Qin Xiaokang Yang Ping Luo Yao Mu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 06:35 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CLcs.CVcs.MA

keywords bimanual manipulationsynthetic datadomain randomizationsim-to-real transferrobotic manipulationVLA modelsdata generationdual-arm tasks

0 comments

The pith

Synthetic data from structured domain randomization lets bimanual robot policies reach 367 percent gains with only 10 real demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RoboTwin 2.0 as a framework that automatically generates large-scale synthetic datasets for dual-arm manipulation tasks. It builds an object library of 731 instances and uses multimodal language models plus simulation-in-the-loop refinement to create executable task code across 50 scenarios. Five specific axes of domain randomization—clutter, lighting, background, tabletop height, and language—are applied to increase diversity and reduce the sim-to-real gap. When this data trains vision-language-action models, performance in real settings improves sharply even when real demonstrations are scarce. This approach matters because collecting sufficient real-world bimanual data remains costly and time-consuming, limiting scalable robot learning.

Core claim

RoboTwin 2.0 supplies an automated pipeline that produces diverse, realistic synthetic data for bimanual manipulation, instantiated over 50 tasks and five robot embodiments, and yields a 10.9 percent improvement in code generation success; downstream VLA policies trained on this synthetic data plus only 10 real demonstrations show a 367 percent relative gain over the 10-demo real baseline, while purely synthetic zero-shot training delivers a 228 percent gain.

What carries the argument

The expert data synthesis pipeline that combines multimodal language models with simulation-in-the-loop refinement to generate task-level execution code, augmented by five-axis domain randomization.

If this is right

Zero-shot models trained solely on the synthetic data achieve 228 percent gains over real-data baselines.
Task code generation success rises by 10.9 percent through the automated MLLM and simulation refinement loop.
Unified evaluation protocols become feasible across 50 dual-arm tasks and five different robot embodiments.
Data diversity from five-axis randomization supports policy robustness to environmental variations in real settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same randomization axes and MLLM pipeline could be adapted to single-arm or multi-robot scenarios to lower data collection costs in other manipulation domains.
If MLLM task generation scales reliably, new manipulation tasks could be synthesized on demand for novel objects or layouts without human redesign.
Over time the method suggests a path toward largely synthetic pre-training of general robotic models before fine-tuning on small real sets.
Validation on dynamic scenes with moving obstacles or deformable objects would test whether the current randomization covers the full range of real-world variability.

Load-bearing premise

Domain randomization across clutter, lighting, background, tabletop height, and language, together with MLLM-generated tasks, produces synthetic data that transfers to real bimanual manipulation without a significant sim-to-real gap.

What would settle it

Real-robot trials in which VLA policies trained on the RoboTwin 2.0 synthetic data plus 10 demonstrations perform no better than or worse than the 10-demo real baseline alone would show the synthetic data does not close the transfer gap.

read the original abstract

Simulation-based data synthesis has emerged as a powerful paradigm for advancing real-world robotic manipulation. Yet existing datasets remain insufficient for robust bimanual manipulation due to (1) the lack of scalable task generation methods and (2) oversimplified simulation environments. We present RoboTwin 2.0, a scalable framework for automated, large-scale generation of diverse and realistic data, together with unified evaluation protocols for dual-arm manipulation. At its core is RoboTwin-OD, an object library of 731 instances across 147 categories with semantic and manipulation-relevant annotations. Building on this, we design an expert data synthesis pipeline that leverages multimodal language models (MLLMs) and simulation-in-the-loop refinement to automatically generate task-level execution code. To improve sim-to-real transfer, RoboTwin 2.0 applies structured domain randomization along five axes: clutter, lighting, background, tabletop height, and language, enhancing data diversity and policy robustness. The framework is instantiated across 50 dual-arm tasks and five robot embodiments. Empirically, it yields a 10.9% gain in code generation success rate. For downstream policy learning, a VLA model trained with synthetic data plus only 10 real demonstrations achieves a 367% relative improvement over the 10-demo baseline, while zero-shot models trained solely on synthetic data obtain a 228% gain. These results highlight the effectiveness of RoboTwin 2.0 in strengthening sim-to-real transfer and robustness to environmental variations. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation. Project Page: https://robotwin-platform.github.io/, Code: https://github.com/robotwin-Platform/robotwin/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoboTwin 2.0 scales synthetic bimanual data with a new object library and MLLM code generation plus visual randomization, delivering large reported policy gains, but leaves physics parameters fixed.

read the letter

The core of this paper is a practical pipeline that builds a 731-object library, uses MLLMs with simulation-in-the-loop to auto-generate task code for 50 dual-arm scenarios, and layers five-axis domain randomization on top. They release the generator, benchmark, and dataset, which is the part most people will actually use. The reported numbers are the headline: mixing the synthetic data with 10 real demos gives a 367% relative lift over the 10-demo baseline, and zero-shot synthetic training gets 228% better. That scale of improvement on bimanual tasks is worth paying attention to if you work on sim-to-real transfer. The engineering is straightforward and the release lowers the barrier for others to try the same approach. The stress-test note on physics is worth checking in the full text. Randomization covers only clutter, lighting, background, table height, and language. Friction, mass, inertia, and contact parameters stay fixed. For contact-rich bimanual work that is a real gap; any transfer success could be tied to the specific sim-real dynamics match rather than the claimed robustness. The abstract also gives no details on baseline implementations, statistical tests, or variance across runs, so the exact size of the gains needs verification. This is for groups already running sim-to-real experiments on manipulation who need more diverse training data or a starting benchmark for dual-arm tasks. It is not a theoretical advance, but the data assets and the reported scale make it worth a referee's time. I would send it to review and ask specifically for physics randomization experiments and clearer controls on the policy results.

Referee Report

2 major / 2 minor

Summary. The paper presents RoboTwin 2.0, a framework for scalable synthetic data generation for bimanual robotic manipulation. It introduces the RoboTwin-OD object library (731 instances, 147 categories), an MLLM-driven expert synthesis pipeline with simulation-in-the-loop refinement for task code generation, structured domain randomization on five axes, and unified benchmarks across 50 dual-arm tasks and five robot embodiments. Empirical claims include a 10.9% gain in code generation success rate, plus downstream policy learning results showing a 367% relative improvement for a VLA model trained on synthetic data plus 10 real demonstrations and a 228% gain for zero-shot synthetic-only training.

Significance. If the central claims hold after addressing the noted gaps, the work provides a useful open-source contribution to robotics research by enabling large-scale, diverse data synthesis for bimanual tasks where real-world data collection is costly. The release of the generator, benchmark, dataset, and code supports reproducibility and could help standardize evaluation in this area.

major comments (2)

[Abstract] Abstract: Domain randomization is stated to be applied along exactly five axes (clutter, lighting, background, tabletop height, and language). Bimanual manipulation involves repeated contacts whose outcomes depend on physics parameters such as friction coefficients, object masses, and joint damping. With no mention of randomizing or identifying these dynamics parameters, the reported zero-shot (228%) and hybrid (367%) policy gains may reflect a specific sim-real dynamics match rather than the claimed robustness, weakening the central assertion of 'strong domain randomization for robust' sim-to-real transfer.
[Abstract] Abstract (empirical claims): The abstract reports precise relative improvements (10.9%, 367%, 228%) without specifying the number of trials, baseline implementations, statistical significance tests, variance, or controls for task difficulty and embodiment variation. These details are load-bearing for evaluating whether the gains demonstrate genuine generalization or are sensitive to particular experimental choices.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a short explicit statement of the fixed physics parameters and any justification for not randomizing them, to preempt reader concerns about the dynamics gap.
[Abstract] Ensure that the definition of 'relative improvement' (e.g., whether it is (new - baseline)/baseline) is stated clearly when first introducing the 367% and 228% figures, to avoid any ambiguity in interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. The comments highlight important aspects of domain randomization and the presentation of results. We address each major comment below and will revise the manuscript to incorporate clarifications and expansions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: Domain randomization is stated to be applied along exactly five axes (clutter, lighting, background, tabletop height, and language). Bimanual manipulation involves repeated contacts whose outcomes depend on physics parameters such as friction coefficients, object masses, and joint damping. With no mention of randomizing or identifying these dynamics parameters, the reported zero-shot (228%) and hybrid (367%) policy gains may reflect a specific sim-real dynamics match rather than the claimed robustness, weakening the central assertion of 'strong domain randomization for robust' sim-to-real transfer.

Authors: We agree that dynamics parameters such as friction coefficients, object masses, and joint damping are important for contact-rich bimanual manipulation and sim-to-real transfer. The current implementation of RoboTwin 2.0 applies structured domain randomization specifically along the five environmental and linguistic axes to target major sources of visual and task variation. The underlying simulator (Isaac Sim) uses fixed but realistic default physics parameters that are held consistent between data generation and policy evaluation. While the reported gains demonstrate the value of the introduced diversity, we acknowledge that the absence of dynamics randomization weakens the 'strong domain randomization' claim. In the revised manuscript, we will expand the randomization to include plausible ranges for friction, mass, and damping, update the abstract and methods section accordingly, and re-evaluate a subset of tasks to quantify the additional benefit. revision: yes
Referee: [Abstract] Abstract (empirical claims): The abstract reports precise relative improvements (10.9%, 367%, 228%) without specifying the number of trials, baseline implementations, statistical significance tests, variance, or controls for task difficulty and embodiment variation. These details are load-bearing for evaluating whether the gains demonstrate genuine generalization or are sensitive to particular experimental choices.

Authors: The full experimental details—including evaluation over 100 trials per task, baseline implementations following standard VLA and behavior cloning protocols, reporting of mean and standard deviation across three random seeds, and explicit controls across all 50 tasks and five embodiments—are provided in the Experiments section. To make the abstract more self-contained and address the concern directly, we will revise it to include a brief qualifier on the evaluation scale and variance (e.g., 'evaluated across 100 trials per task with standard deviations reported') while respecting length constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical data generation and policy evaluation results stand independently.

full rationale

The paper describes an automated data synthesis pipeline using MLLMs and domain randomization over five visual/scene axes, then reports measured performance gains (10.9% code success, 367% and 228% relative policy improvements) from training VLA models on the resulting synthetic data. These are downstream empirical outcomes obtained by running the generator, training policies, and evaluating transfer; no equations, fitted parameters, or predictions are defined in terms of the target metrics themselves. No self-citation chains or uniqueness theorems are invoked to force the central claims. The derivation chain is therefore self-contained as an engineering pipeline plus experimental measurement.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on assumptions about simulation fidelity and MLLM capabilities for code generation, which are standard in the field but not proven within the paper.

axioms (2)

domain assumption Multimodal language models can reliably generate executable robot control code from task descriptions.
Central to the expert data synthesis pipeline.
domain assumption Structured domain randomization in simulation improves sim-to-real transfer for manipulation policies.
Basis for the five randomization axes.

pith-pipeline@v0.9.0 · 5736 in / 1445 out tokens · 53065 ms · 2026-05-11T06:35:01.518695+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
cs.CV 2026-05 conditional novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
RotVLA: Rotational Latent Action for Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning
cs.RO 2026-05 unverdicted novelty 7.0

RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
cs.CV 2026-05 unverdicted novelty 7.0

Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational ov...
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
cs.RO 2026-05 conditional novelty 7.0

Frequency analysis of smooth robot actions bounds denoising error to low-frequency modes, enabling a sub-1% parameter 3D diffusion policy with two-step inference that reaches SOTA on manipulation benchmarks.
VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU
cs.OS 2026-05 unverdicted novelty 7.0

VUDA enables spatial sharing between CUDA and Vulkan on GPUs via channel redirection and page-table grafting, achieving up to 85% higher throughput than temporal baselines in embodied AI tasks.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 accept novelty 7.0

3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.
HiPolicy: Hierarchical Multi-Frequency Action Chunking for Policy Learning
cs.RO 2026-04 unverdicted novelty 7.0

HiPolicy is a new hierarchical multi-frequency action chunking method for imitation learning that jointly generates coarse and fine action sequences with entropy-guided execution to improve performance and efficiency ...
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
cs.RO 2026-04 conditional novelty 7.0

BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
JailWAM: Jailbreaking World Action Models in Robot Control
cs.RO 2026-04 unverdicted novelty 7.0

JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bim...
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 6.0

GridS reduces visual tokens in VLA models to under 10% of the original count via task-aware differentiable resampling, delivering 76% lower FLOPs with no drop in task success rate on benchmarks and real robots.
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions
cs.RO 2026-05 unverdicted novelty 6.0

HeteroGenManip decouples grasp localization from interaction planning using task-conditioned foundation models and multi-model diffusion policies, delivering 31% average gains in broad simulation tasks and 36.7% in fo...
HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions
cs.RO 2026-05 unverdicted novelty 6.0

A task-conditioned two-stage system decouples grasp localization from interaction trajectory planning using specialized foundation models to improve generalization across heterogeneous object types.
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
cs.RO 2026-05 unverdicted novelty 6.0

VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World
cs.CV 2026-05 unverdicted novelty 6.0

PhysForge generates physics-grounded 3D assets via a VLM-planned Hierarchical Physical Blueprint and a KineVoxel Injection diffusion model, backed by the new PhysDB dataset of 150,000 annotated assets.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
cs.CV 2026-05 unverdicted novelty 6.0

A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
cs.RO 2026-05 unverdicted novelty 6.0

Hydra-DP3 achieves SOTA visuomotor performance with under 1% of prior 3D diffusion policy parameters by using frequency analysis to justify a lightweight decoder and two-step DDIM inference.
Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
cs.RO 2026-05 unverdicted novelty 6.0

Hydra-DP3 is a lightweight 3D diffusion policy that uses frequency analysis of smooth action trajectories to enable two-step DDIM inference and achieves state-of-the-art results with under 1% of prior parameters.
MotuBrain: An Advanced World Action Model for Robot Control
cs.RO 2026-04 unverdicted novelty 6.0

MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...
Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations
cs.CV 2026-04 unverdicted novelty 6.0

RecGen achieves state-of-the-art 3D multi-object scene reconstruction from sparse RGB-D views by combining compositional synthetic scene generation with strong 3D shape priors, outperforming SAM3D by 30%+ in shape qua...
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
LeHome: A Simulation Environment for Deformable Object Manipulation in Household Scenarios
cs.RO 2026-04 unverdicted novelty 6.0

LeHome is a simulation platform offering high-fidelity dynamics for robotic manipulation of varied deformable objects in household settings, with support for multiple robot embodiments including low-cost hardware.
From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation
cs.RO 2026-04 unverdicted novelty 6.0

Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps
cs.RO 2026-04 unverdicted novelty 6.0

AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
cs.CV 2026-04 unverdicted novelty 6.0

CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement
cs.RO 2026-04 unverdicted novelty 6.0

AnySlot decouples language grounding from low-level control by inserting an explicit visual goal image, yielding better zero-shot performance on precise slot placement tasks than flat VLA policies.
V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

V-CAGE automates the creation of scalable, high-quality robotic manipulation datasets through context-aware scene construction, closed-loop visual verification, and perceptually-driven compression.
AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly
cs.RO 2026-04 unverdicted novelty 6.0

AssemLM uses a specialized point cloud encoder inside a multimodal LLM to reach state-of-the-art 6D pose prediction for assembly tasks, backed by a new 900K-sample benchmark called AssemBench.
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
cs.RO 2026-04 unverdicted novelty 6.0

SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
cs.CV 2026-03 unverdicted novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
cs.RO 2025-09 conditional novelty 6.0

SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
cs.RO 2026-05 unverdicted novelty 5.0

AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.
X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction
cs.RO 2026-05 unverdicted novelty 5.0

X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
cs.RO 2026-05 unverdicted novelty 5.0

Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
cs.CL 2026-05 unverdicted novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 5.0

STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising r...
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
cs.RO 2026-04 unverdicted novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
R3D: Revisiting 3D Policy Learning
cs.CV 2026-04 unverdicted novelty 5.0

A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
cs.CV 2026-04 unverdicted novelty 5.0

HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
cs.CV 2026-04 unverdicted novelty 5.0

HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
cs.RO 2026-04 unverdicted novelty 5.0

Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment
cs.RO 2026-04 unverdicted novelty 5.0

CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...
Causal World Modeling for Robot Control
cs.CV 2026-01 unverdicted novelty 5.0

LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
Motus: A Unified Latent Action World Model
cs.CV 2025-12 unverdicted novelty 5.0

Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 58 Pith papers · 12 internal anchors

[1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review arXiv 2025
[5]

Closing the sim-to-real loop: Adapting simulation randomization with real world experience

Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan Ratliff, and Dieter Fox. Closing the sim-to-real loop: Adapting simulation randomization with real world experience. In 2019 International Conference on Robotics and Automation (ICRA) , pages 8973–8979. IEEE, 2019

work page 2019
[6]

Roboscript: Code generation for free-form manipulation tasks across real and simulation, 2024

Junting Chen, Yao Mu, Qiaojun Yu, Tianming Wei, Silang Wu, Zhecheng Yuan, Zhixuan Liang, Chao Yang, Kaipeng Zhang, Wenqi Shao, Yu Qiao, Huazhe Xu, Mingyu Ding, and Ping Luo. Roboscript: Code generation for free-form manipulation tasks across real and simulation, 2024

work page 2024
[7]

G3flow: Generative 3d semantic flow for pose-aware and generalizable object manipulation

Tianxing Chen, Yao Mu, Zhixuan Liang, Zanxin Chen, Shijia Peng, Qiangyu Chen, Mingkun Xu, Ruizhen Hu, Hongyuan Zhang, Xuelong Li, et al. G3flow: Generative 3d semantic flow for pose-aware and generalizable object manipulation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1735–1744, 2025

work page 2025
[8]

Benchmarking generalizable bimanual manipulation: Robotwin dual-arm collaboration challenge at cvpr 2025 meis workshop

Tianxing Chen, Kaixuan Wang, Zhaohui Yang, Yuhao Zhang, Zanxin Chen, Baijun Chen, Wanxi Dong, Ziyuan Liu, Dong Chen, Tianshuo Yang, et al. Benchmarking generalizable bimanual manipulation: Robotwin dual-arm collaboration challenge at cvpr 2025 meis workshop. arXiv preprint arXiv:2506.23351, 2025

work page arXiv 2025
[9]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023

work page 2023
[10]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 13142–13153, 2023

work page 2023
[11]

Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data,

Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Heming Cui, et al. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data. arXiv preprint arXiv:2505.03233, 2025

work page arXiv 2025
[12]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021

work page internal anchor Pith review arXiv 2021
[13]

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020

work page internal anchor Pith review arXiv 2002
[14]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024

work page internal anchor Pith review arXiv 2024
[15]

Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904,

Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning. arXiv preprint arXiv:2504.18904, 2025

work page arXiv 2025
[16]

Maniskill2: A unified benchmark for generalizable manipulation skills

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills. In The Eleventh International Conference on Learning Representations , 2023. 13

work page 2023
[17]

Unixcoder: Unified cross-modal pre-training for code representation,

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850, 2022

work page arXiv 2022
[18]

Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model

Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model. arXiv preprint arXiv:2408.09559, 2024

work page arXiv 2024
[19]

arXiv preprint arXiv:2502.13092 , year=

Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Yao Mu, Hongyuan Zhang, Wenqi Shao, and Ping Luo. Text2world: Benchmarking large language models for symbolic world model generation. arXiv preprint arXiv:2502.13092, 2025

work page arXiv 2025
[20]

Gensim2: Scaling robot data generation with multi-modal and reasoning llms

Pu Hua, Minghuan Liu, Annabella Macaluso, Yunfeng Lin, Weinan Zhang, Huazhe Xu, and Lirui Wang. Gensim2: Scaling robot data generation with multi-modal and reasoning llms. In 8th Annual Conference on Robot Learning

work page
[21]

3d diffuser actor: Policy diffusion with 3d scene representations, 2024

Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. arXiv preprint arXiv:2402.10885, 2024

work page arXiv 2024
[22]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Openvla: An open-source vision-language- action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language- action model. In 8th Annual Conference on Robot Learning

work page
[24]

Autobio: A simulation and benchmark for robotic automation in digital biology laboratory

Zhiqian Lan, Yuxuan Jiang, Ruiqi Wang, Xuanbing Xie, Rongkui Zhang, Yicheng Zhu, Peihang Li, Tianshuo Yang, Tianxing Chen, Haoyu Gao, et al. Autobio: A simulation and benchmark for robotic automation in digital biology laboratory. arXiv preprint arXiv:2505.14030, 2025

work page arXiv 2025
[25]

Dart: Noise injection for robust imitation learning

Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. In Conference on robot learning, pages 143–156. PMLR, 2017

work page 2017
[26]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024

work page Pith review arXiv 2024
[27]

Adaptdiffuser: Diffusion models as adaptive self-evolving planners

Zhixuan Liang, Yao Mu, Mingyu Ding, Fei Ni, Masayoshi Tomizuka, and Ping Luo. Adaptdiffuser: Diffusion models as adaptive self-evolving planners. In International Conference on Machine Learning , pages 20725–20745. PMLR, 2023

work page 2023
[28]

Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution

Zhixuan Liang, Yao Mu, Hengbo Ma, Masayoshi Tomizuka, Mingyu Ding, and Ping Luo. Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 16467–16476, 2024

work page 2024
[29]

Dexhanddiff: Interaction-aware diffusion planning for adaptive dexterous manipulation

Zhixuan Liang, Yao Mu, Yixiao Wang, Tianxing Chen, Wenqi Shao, Wei Zhan, Masayoshi Tomizuka, Ping Luo, and Mingyu Ding. Dexhanddiff: Interaction-aware diffusion planning for adaptive dexterous manipulation. In Proceedings of the Computer Vision and Pattern Recognition Conference , pages 1745– 1755, 2025

work page 2025
[30]

Libero: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36:44776–44791, 2023

work page 2023
[31]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review arXiv 2024
[32]

Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

work page 2022
[33]

Robocodex: Multimodal code generation for robotic behavior synthesis

Yao Mu, Junting Chen, Qing-Long Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, et al. Robocodex: Multimodal code generation for robotic behavior synthesis. In International Conference on Machine Learning , pages 36434–36454. PMLR, 2024

work page 2024
[34]

Robotwin: Dual-arm robot benchmark with generative digital twins

Yao Mu, Tianxing Chen, Zanxin Chen, Shijia Peng, Zhiqian Lan, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, et al. Robotwin: Dual-arm robot benchmark with generative digital twins. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2025. 14

work page 2025
[35]

Robocasa: Large-scale simulation of everyday tasks for generalist robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. In Robotics: Science and Systems (RSS) , 2024

work page 2024
[36]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 6892–6903. IEEE, 2024

work page 2024
[37]

Sim-to-real transfer of robotic control with dynamics randomization

Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pages 3803–3810. IEEE, 2018

work page 2018
[38]

Epopt: Learning robust neural network policies using model ensembles

Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. Epopt: Learning robust neural network policies using model ensembles. In International Conference on Learning Representations , 2017

work page 2017
[39]

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297, 2020

work page internal anchor Pith review arXiv 2009
[40]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Domain randomization for transferring deep neural networks from simulation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) , pages 23–30. IEEE, 2017

work page 2017
[42]

Rise: 3d perception makes real-world robot imitation simple and effective

Chenxi Wang, Hongjie Fang, Hao-Shu Fang, and Cewu Lu. Rise: 3d perception makes real-world robot imitation simple and effective. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2870–2877. IEEE, 2024

work page 2024
[43]

Robogen: Towards unleashing infinite data for automated robot learning via generative simulation, 2023

Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation, 2023

work page 2023
[44]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision- language model with plug-in diffusion expert for general robot control. arXiv preprint arXiv:2502.05855, 2025

work page Pith review arXiv 2025
[45]

Tinyvla: Toward fast, data-efficient vision- language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 10(4):3988–3995, 2025

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, and Jian Tang. Tinyvla: Toward fast, data-efficient vision- language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 10(4):3988–3995, 2025

work page 2025
[46]

Code similarity detection using ast and textual information

Wu Wen, Xiaobo Xue, Ya Li, Peng Gu, and Jianfeng Xu. Code similarity detection using ast and textual information. International Journal of Performability Engineering, 15(10):2683, 2019

work page 2019
[47]

Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation,

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024

work page arXiv 2024
[48]

Sapien: A simulated part-based interactive environment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 11097–11107, 2020

work page 2020
[49]

Latent action pretraining from videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos. InCoRL 2024 Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond

work page 2024
[50]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020

work page 2020
[51]

3d diffusion policy

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy. arXiv e-prints, pages arXiv–2403, 2024

work page 2024
[52]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020. 15 A Contributions Project Leaders Tianxing Chen, Yao Mu, Zhixuan Liang Roadmap & Methodology Yao Mu, Tianxing Chen, Ping Luo, Yu...

work page internal anchor Pith review arXiv 2009