super hub Baseline reference

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Baijun Chen, Tianxing Chen, Yibin Liu, Zanxin Chen, Zijian Cai, Zixuan Li · 2025 · cs.RO · arXiv 2506.18088

Baseline reference. 52% of citing Pith papers use this work as a benchmark or comparison.

172 Pith papers citing it

Baseline 52% of classified citations

open full Pith review browse 172 citing papers more from Baijun Chen arXiv PDF

abstract

Simulation-based data synthesis has emerged as a powerful paradigm for advancing real-world robotic manipulation. Yet existing datasets remain insufficient for robust bimanual manipulation due to (1) the lack of scalable task generation methods and (2) oversimplified simulation environments. We present RoboTwin 2.0, a scalable framework for automated, large-scale generation of diverse and realistic data, together with unified evaluation protocols for dual-arm manipulation. At its core is RoboTwin-OD, an object library of 731 instances across 147 categories with semantic and manipulation-relevant annotations. Building on this, we design an expert data synthesis pipeline that leverages multimodal language models (MLLMs) and simulation-in-the-loop refinement to automatically generate task-level execution code. To improve sim-to-real transfer, RoboTwin 2.0 applies structured domain randomization along five axes: clutter, lighting, background, tabletop height, and language, enhancing data diversity and policy robustness. The framework is instantiated across 50 dual-arm tasks and five robot embodiments. Empirically, it yields a 10.9% gain in code generation success rate. For downstream policy learning, a VLA model trained with synthetic data plus only 10 real demonstrations achieves a 367% relative improvement over the 10-demo baseline, while zero-shot models trained solely on synthetic data obtain a 228% gain. These results highlight the effectiveness of RoboTwin 2.0 in strengthening sim-to-real transfer and robustness to environmental variations. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation. Project Page: https://robotwin-platform.github.io/, Code: https://github.com/robotwin-Platform/robotwin/.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 22 background 17 other 1

citation-polarity summary

use dataset 21 background 16 unclear 3

claims ledger

abstract Simulation-based data synthesis has emerged as a powerful paradigm for advancing real-world robotic manipulation. Yet existing datasets remain insufficient for robust bimanual manipulation due to (1) the lack of scalable task generation methods and (2) oversimplified simulation environments. We present RoboTwin 2.0, a scalable framework for automated, large-scale generation of diverse and realistic data, together with unified evaluation protocols for dual-arm manipulation. At its core is RoboTwin-OD, an object library of 731 instances across 147 categories with semantic and manipulation-releva

authors

Baijun Chen Tianxing Chen Yibin Liu Zanxin Chen Zijian Cai Zixuan Li

co-cited works

representative citing papers

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

cs.CV · 2026-05-26 · unverdicted · novelty 8.0

SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.

Labimus: A Simulation and Benchmark for Humanoid Dexterous Manipulation in Chemical Laboratory

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

Labimus is the first benchmark for humanoid dexterous manipulation in organic chemistry laboratories, exposing a gap between task completion and required experimental precision.

ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

cs.RO · 2026-06-25 · unverdicted · novelty 7.0 · 2 refs

ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

cs.RO · 2026-06-22 · unverdicted · novelty 7.0 · 2 refs

LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

EBench is a benchmark that evaluates generalist mobile manipulation policies on 26 tasks across 5 capability and 4 generalization dimensions, revealing distinct capability profiles among models with similar success rates.

ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.

Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

LeaP introduces a learnable proprioception-conditioned diagonal Gaussian source prior for generative robot policies, raising average success rates on 15 RoboTwin tasks from baselines by 6.5-25.5 points.

DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World

cs.RO · 2026-06-10 · unverdicted · novelty 7.0

DuoBench introduces eleven bimanual manipulation tasks with stage-based evaluation and human datasets to benchmark imitation-learning and vision-language-action policies on dual-arm robots in sim and real settings.

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

cs.RO · 2026-06-06 · unverdicted · novelty 7.0

Q-VGM introduces value-gradient matching via VGG-Flow to improve flow-matching VLA policies with a Cal-QL critic, achieving success rate lifts on LIBERO, RoboTwin, and real-robot tasks.

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

cs.RO · 2026-05-18 · unverdicted · novelty 7.0

ManiSoft is a new benchmark featuring a soft-body simulator, four deformable control tasks, and an automated pipeline generating 6300 scenes with expert trajectories for training and evaluating vision-language policies on continuum robots.

DSSP: Diffusion State Space Policy with Full-History Encoding

cs.RO · 2026-05-14 · conditional · novelty 7.0

DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

cs.CV · 2026-05-14 · conditional · novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.

RotVLA: Rotational Latent Action for Vision-Language-Action Model

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

cs.RO · 2026-05-12 · conditional · novelty 7.0

GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.

RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational overhead during adaptation.

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

cs.RO · 2026-05-02 · unverdicted · novelty 7.0 · 4 refs

HDP3 is a pocket-scale 3D diffusion policy with a Diffusion Mixer decoder that achieves state-of-the-art visuomotor control using two-step DDIM inference and under 1% of the parameters of prior 3D diffusion policies.

VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU

cs.OS · 2026-05-02 · unverdicted · novelty 7.0

VUDA enables spatial sharing between CUDA and Vulkan on GPUs via channel redirection and page-table grafting, achieving up to 85% higher throughput than temporal baselines in embodied AI tasks.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

HiPolicy: Hierarchical Multi-Frequency Action Chunking for Policy Learning

cs.RO · 2026-04-07 · unverdicted · novelty 7.0

HiPolicy is a new hierarchical multi-frequency action chunking method for imitation learning that jointly generates coarse and fine action sequences with entropy-guided execution to improve performance and efficiency in robotic manipulation.

citing papers explorer

Showing 50 of 167 citing papers after filters.

SpatialBench: Is Your Spatial Foundation Model an All-Round Player? cs.CV · 2026-05-26 · unverdicted · none · ref 14 · internal anchor
SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.
Labimus: A Simulation and Benchmark for Humanoid Dexterous Manipulation in Chemical Laboratory cs.RO · 2026-06-30 · unverdicted · none · ref 20 · internal anchor
Labimus is the first benchmark for humanoid dexterous manipulation in organic chemistry laboratories, exposing a gap between task completion and required experimental precision.
ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models cs.RO · 2026-06-25 · unverdicted · none · ref 36 · 2 links · internal anchor
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models cs.RO · 2026-06-22 · unverdicted · none · ref 11 · 2 links · internal anchor
LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi
EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies cs.RO · 2026-06-16 · unverdicted · none · ref 4 · internal anchor
EBench is a benchmark that evaluates generalist mobile manipulation policies on 26 tasks across 5 capability and 4 generalization dimensions, revealing distinct capability profiles among models with similar success rates.
ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation cs.RO · 2026-06-16 · unverdicted · none · ref 9 · internal anchor
ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.
Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies cs.RO · 2026-06-16 · unverdicted · none · ref 13 · internal anchor
LeaP introduces a learnable proprioception-conditioned diagonal Gaussian source prior for generative robot policies, raising average success rates on 15 RoboTwin tasks from baselines by 6.5-25.5 points.
DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World cs.RO · 2026-06-10 · unverdicted · none · ref 23 · internal anchor
DuoBench introduces eleven bimanual manipulation tasks with stage-based evaluation and human datasets to benchmark imitation-learning and vision-language-action policies on dual-arm robots in sim and real settings.
X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining cs.CV · 2026-06-07 · unverdicted · none · ref 45 · internal anchor
X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.
Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies cs.RO · 2026-06-06 · unverdicted · none · ref 32 · internal anchor
Q-VGM introduces value-gradient matching via VGG-Flow to improve flow-matching VLA policies with a Cal-QL critic, achieving success rate lifts on LIBERO, RoboTwin, and real-robot tasks.
ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics cs.RO · 2026-05-18 · unverdicted · none · ref 3 · internal anchor
ManiSoft is a new benchmark featuring a soft-body simulator, four deformable control tasks, and an automated pipeline generating 6300 scenes with expert trajectories for training and evaluating vision-language policies on continuum robots.
DSSP: Diffusion State Space Policy with Full-History Encoding cs.RO · 2026-05-14 · conditional · none · ref 6 · internal anchor
DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL cs.CV · 2026-05-14 · conditional · none · ref 7 · internal anchor
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.
RotVLA: Rotational Latent Action for Vision-Language-Action Model cs.RO · 2026-05-13 · unverdicted · none · ref 13 · internal anchor
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 11 · internal anchor
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model cs.RO · 2026-05-12 · conditional · none · ref 47 · internal anchor
GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.
RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning cs.RO · 2026-05-12 · unverdicted · none · ref 11 · internal anchor
RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models cs.CV · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational overhead during adaptation.
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 8 · internal anchor
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 41 · internal anchor
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control cs.RO · 2026-05-02 · unverdicted · none · ref 3 · 4 links · internal anchor
HDP3 is a pocket-scale 3D diffusion policy with a Diffusion Mixer decoder that achieves state-of-the-art visuomotor control using two-step DDIM inference and under 1% of the parameters of prior 3D diffusion policies.
VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU cs.OS · 2026-05-02 · unverdicted · none · ref 8 · internal anchor
VUDA enables spatial sharing between CUDA and Vulkan on GPUs via channel redirection and page-table grafting, achieving up to 85% higher throughput than temporal baselines in embodied AI tasks.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 122 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
HiPolicy: Hierarchical Multi-Frequency Action Chunking for Policy Learning cs.RO · 2026-04-07 · unverdicted · none · ref 3 · internal anchor
HiPolicy is a new hierarchical multi-frequency action chunking method for imitation learning that jointly generates coarse and fine action sequences with entropy-guided execution to improve performance and efficiency in robotic manipulation.
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination cs.RO · 2026-04-07 · conditional · none · ref 7 · internal anchor
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
JailWAM: Jailbreaking World Action Models in Robot Control cs.RO · 2026-04-07 · unverdicted · none · ref 3 · internal anchor
JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs cs.RO · 2026-02-09 · unverdicted · none · ref 25 · internal anchor
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training cs.AI · 2026-02-05 · unverdicted · none · ref 3 · internal anchor
RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.
Information Filtering via Variational Regularization for Robot Manipulation cs.RO · 2026-01-29 · unverdicted · none · ref 4 · internal anchor
Variational Regularization imposes an adaptive information bottleneck on noisy intermediate features in DP3-UNet and DP3-DiT policies, consistently raising task success rates on RoboTwin2.0, Adroit, and MetaWorld while achieving new state-of-the-art results.
TouchGuide: Inference-Time Steering of Visuomotor Policies via Touch Guidance cs.RO · 2026-01-28 · unverdicted · none · ref 10 · internal anchor
TouchGuide improves contact-rich robot manipulation by steering diffusion or flow-matching visuomotor policies with tactile feasibility scores from a contrastively trained Contact Physical Model.
ABot-M0.5: Unified Mobility-and-Manipulation World Action Model cs.CV · 2026-07-01 · unverdicted · none · ref 9 · internal anchor
ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.
Training Vision-Language-Action Models with Dense Embodied Chain-of-Thought Supervision cs.RO · 2026-06-29 · unverdicted · none · ref 13 · 2 links · internal anchor
ZR-0 is a dual-stream VLA model trained with dense ECoT supervision on 60M frames from 400K trajectories to enable cross-embodiment transfer in simulation and real-world settings.
SA-VLA: State-aware tokenizer for improving Vision-Language-Action Models' performance cs.RO · 2026-06-29 · unverdicted · none · ref 28 · internal anchor
SA-VLA adds state conditioning to VQ-based action tokenization in VLA policies, expanding each discrete token's effective support to state-dependent actions and raising average success rates from 0.29 to 0.56 on 12 sim tasks and 0.15 to 0.33 on 3 real tasks.
Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models cs.RO · 2026-06-29 · unverdicted · none · ref 9 · internal anchor
T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.
The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks cs.RO · 2026-06-26 · unverdicted · none · ref 56 · 2 links · internal anchor
TISED decomposes inference optimization effects on embodied tasks and identifies paradoxical outcomes where faster per-step inference can increase task completion time on static tasks or raise success rates on dynamic tasks.
Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision cs.RO · 2026-06-25 · unverdicted · none · ref 15 · internal anchor
StaKe adds lightweight auxiliary heads for manipulation stage identification and next-gripper-transition keyframe prediction to VLA fine-tuning, reporting relative success rate gains of 14% in bimanual simulation and 56% on single-arm real-robot tasks.
G$^3$VLA: Geometric inductive bias for Vision-Language-Action Models cs.RO · 2026-06-23 · unverdicted · none · ref 26 · internal anchor
G³VLA injects calibrated camera geometry into VLA visual tokens via intrinsic-conditioned ray embeddings, PRoPE, and bidirectional cross-view fusion, producing consistent gains on LIBERO, RoboCasa24, RoboTwin2.0, and real-robot tasks when added to π₀.
Distilling Collaborative Dynamics into Latent Space for Implicit Coordination in Decentralized Multi-Agent Manipulation cs.RO · 2026-06-22 · unverdicted · none · ref 42 · 2 links · internal anchor
CLS-DP distills privileged multi-agent dynamics into a collaborative latent space that each agent infers from local RGB observations to condition diffusion-based actions, achieving 38% mean success on six RoboFactory tasks versus 20% for the best centralized baseline.
Decoupling the Declarative from the Procedural in Vision-Language-Action Models cs.RO · 2026-06-19 · unverdicted · none · ref 13 · internal anchor
w²VLA restructures VLA information flow to decouple declarative semantics from procedural skills, enabling zero-shot transfer to novel objects.
Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems cs.RO · 2026-06-18 · unverdicted · none · ref 12 · internal anchor
Co-VLA replaces the monolithic action head in VLA models with a coordination-aware Structured Action Expert and Latent-Aware Controller, reporting 27% gains on tight bimanual tasks and doubled OOD performance.
EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies cs.CV · 2026-06-18 · unverdicted · none · ref 35 · 2 links · internal anchor
EventVLA introduces foundational visual anchors and a Keyframe Evidence Memory module that predicts future keyframe probabilities from VLA embeddings to improve long-horizon task success by an average of 40% on 17 simulation and 4 real-world tasks.
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing? cs.CV · 2026-06-17 · unverdicted · none · ref 90 · internal anchor
ImageWAM shows image editing models can replace video generation in world action models, delivering better performance with 6x lower FLOPs and 4x lower latency by using edit-derived KV caches as compact context.
Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos cs.CV · 2026-06-17 · unverdicted · none · ref 27 · 2 links · internal anchor
A Hybrid Disentangled VQ-VAE with physical masks creates a cross-embodiment action codebook from human videos, allowing VLA pre-training that adapts to new embodiments with only 50 trajectories.
AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation cs.RO · 2026-06-16 · unverdicted · none · ref 27 · internal anchor
AnnotateAnything converts passive 3D assets into manipulation-ready assets by combining vision-language reasoning for semantics with parallel physics pipelines for executable action annotations such as grasps and articulations.
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining cs.RO · 2026-06-15 · unverdicted · none · ref 55 · internal anchor
ACE-Ego-0 is a VLA pretraining framework that turns egocentric human videos into robot-format pseudo-actions via a video-to-action pipeline and trains jointly with robot data under a reliability-aware objective.
RepWAM: World Action Modeling with Representation Visual-Action Tokenizers cs.CV · 2026-06-11 · unverdicted · none · ref 8 · internal anchor
RepWAM introduces representation visual-action tokenizers to pretrain world action models that jointly model future visual states and latent actions under instructions for improved robot manipulation.
MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models cs.CV · 2026-06-11 · unverdicted · none · ref 12 · internal anchor
MaskWAM unifies mask prompting and prediction in world-action models via Mixture of Transformers to improve robotic policy generalization on language-ambiguous tasks.
See Selectively, Act Adaptively: Dual-Level Structural Decomposition for Bimanual Robot Manipulation cs.RO · 2026-06-11 · unverdicted · none · ref 41 · internal anchor
A VLA policy using view-selective visual routing and interaction-aware action MoE improves average success by 27.7% in simulation and 43.3% in real-world bimanual tasks over monolithic baselines.
An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics cs.RO · 2026-06-11 · unverdicted · none · ref 29 · internal anchor
Pipette supplies an open wet-lab simulation platform, 11-task benchmark, and perturbation-based augmentation pipeline that raises VLA success rates on sample handling and device tasks from limited demonstrations.
APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies cs.RO · 2026-06-10 · unverdicted · none · ref 30 · internal anchor
APT pretrains the action expert as a vision-action prior on frozen VLM features then adds language through gated fusion to improve OOD instruction generalization in continuous-action VLA policies.

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer