pith. sign in

arxiv: 2507.12768 · v2 · submitted 2025-07-17 · 💻 cs.CV · cs.LG· cs.RO

AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation

Pith reviewed 2026-05-19 03:48 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO
keywords bimanual manipulationtask-agnostic explorationembodiment modelinginverse dynamicsrobot learningvisuomotor controldata reusegeneralization
0
0 comments X

The pith

AnyPos raises bimanual manipulation success rates by 30-40% using independent task-agnostic image-action pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that robot manipulation policies generalize better when embodiment dynamics are learned first from large-scale task-agnostic data in the form of independent image-action pairs generated by automated exploration. This approach decouples low-level physical feasibility modeling from high-level task-specific policy learning, unlike traditional sequential task data. If correct, it would allow the same action data to be reused across many tasks and different robot platforms while improving accuracy and success. A sympathetic reader would care because robot data collection is expensive and embodiment-specific, so scalable reuse could reduce the data bottleneck in visuomotor learning.

Core claim

AnyPos integrates large-scale automated task-agnostic exploration to generate diverse safe trajectories with robust embodiment modeling via inverse dynamics learning. It decouples arm and end-effector motions and uses a direction-aware decoder to stabilize predictions under distribution shift. The resulting representations couple directly with diverse high-level policies, producing a 51% improvement in test accuracy and 30-40% higher success rates on tasks including operating a microwave, toasting bread, folding clothes, watering plants, and scrubbing plates.

What carries the argument

Task-agnostic embodiment modeling via inverse dynamics learning on independent image-action pairs, with arm and end-effector decoupling plus a direction-aware decoder.

If this is right

  • Action data becomes reusable across tasks because it is no longer tied to sequential task execution.
  • Embodiment representations can be learned once and then paired with many different high-level policies.
  • Test accuracy improves by 51% over the standard baseline through stabilized inverse dynamics predictions.
  • Success rates rise 30-40% on concrete bimanual tasks such as microwave operation and plate scrubbing.
  • Cross-platform transfer becomes feasible by keeping embodiment dynamics separate from task logic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Exploration data collected once on one robot body could support policy learning for similar but not identical arms.
  • The same pipeline might reduce reliance on human demonstrations when extending to mobile manipulators.
  • Direction-aware decoding could be tested for stability under real-world lighting or camera changes.
  • Pairing the learned representations with model-based planners might extend performance to longer sequences.

Load-bearing premise

Independent image-action pairs generated by automated task-agnostic exploration can sufficiently cover the full embodiment workspace and capture all physically feasible actions without sequential task-specific data.

What would settle it

Applying AnyPos to a new bimanual task outside the exploration coverage and finding that success rates drop to match or fall below strong baselines.

Figures

Figures reproduced from arXiv: 2507.12768 by Guodong Liu, Hang Su, Hengkai Tan, Jun Zhu, Shuhe Huang, Xinyi Mao, Yao Feng, Zhongkai Hao.

Figure 1
Figure 1. Figure 1: Overview of AnyPos. Our efficient auto-collected task-agnostic action collection method combines AnyPos training, achieving state-of-the-art accuracy and generalizability of image-to-action regression to unseen tasks. ∗Equal contribution; †: project lead; Project Page: VIDAR & AnyPos Preprint. Under review. arXiv:2507.12768v1 [cs.CV] 17 Jul 2025 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: AnyPos illustration. We obtain a task-agnostic training dataset covering the entire cubic workspace of dual robotic arms using ATARA. Input to AnyPos: An image containing the robotic arms. Output of AnyPos: The action/joint position values inferred from the image. further inject physical priors—such as joint angles and link orientations—we design a decoder that aligns visual features (e.g., from DINOv2 [31… view at source ↗
Figure 3
Figure 3. Figure 3: The schematic of the dual-arm setup. The red box is added manually, not model in￾put. The bottom-left/right sub￾figures display left/right grippers. The top subfigure depicts the 2 lightweight 6-DOF robotic arms, each comprising 2 base joints, 1 elbow joint, and 3 high-precision wrist joints. We evaluate AnyPos through three progressively rigorous exper￾iments: (a) Action Prediction Accuracy: We compare An… view at source ↗
Figure 4
Figure 4. Figure 4: The results of AnyPos with video replay to accomplish various manipulation tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The results of AnyPos collaborating with video generation models to accomplish various [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attention heatmap of the input image. Here we only estimate the qpos of the left arm, but there is a clear attention focus on the right arm. demonstrating that the model can not fully distangle the two arm during inference. A.2 Distribution of ATARA and Test Dataset To measure the coverage of the action space of our random actions, we evaluate the distribution of qpos on each dimension, shown in [PITH_FUL… view at source ↗
Figure 7
Figure 7. Figure 7: Qpos distribution of task-agnostic random actions and test dataset. The figure calculates the frequency distribution of qpos in 14 dimensions. We show that random action can cover all the possible qpos in each dimension. Note that the volume of ATARA’s data significantly exceeds that of the test dataset. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The results of AnyPos collaborating with video replay to accomplish various manipulation [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The results of baseline (ResNet+MLP) collaborating with video replay to accomplish [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The results of AnyPos-Human (trained with human-collected data) collaborating with [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sampled results of AnyPos collaborating with generated video to accomplish various [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Hardware features [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
read the original abstract

Learning generalizable manipulation policies hinges on data, yet robot manipulation data is scarce and often entangled with specific embodiments, making both cross-task and cross-platform transfer difficult. We tackle this challenge with task-agnostic embodiment modeling, which learns embodiment dynamics directly from task-agnostic action data and decouples them from high-level policy learning. By focusing on exploring all feasible actions of the embodiment to capture what is physically feasible and consistent, task-agnostic data takes the form of independent image-action pairs with the potential to cover the entire embodiment workspace, unlike task-specific data, which is sequential and tied to concrete tasks. This data-driven perspective bypasses the limitations of traditional dynamics-based modeling and enables scalable reuse of action data across different tasks. Building on this principle, we introduce AnyPos, a unified pipeline that integrates large-scale automated task-agnostic exploration with robust embodiment modeling through inverse dynamics learning. AnyPos generates diverse yet safe trajectories at scale, then learns embodiment representations by decoupling arm and end-effector motions and employing a direction-aware decoder to stabilize predictions under distribution shift, which can be seamlessly coupled with diverse high-level policy models. In comparison to the standard baseline, AnyPos achieves a 51% improvement in test accuracy. On manipulation tasks such as operating a microwave, toasting bread, folding clothes, watering plants, and scrubbing plates, AnyPos raises success rates by 30-40% over strong baselines. These results highlight data-driven embodiment modeling as a practical route to overcoming data scarcity and achieving generalization across tasks and platforms in visuomotor control. Project page: https://embodiedfoundation.github.io/vidar_anypos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AnyPos, a unified pipeline that performs large-scale automated task-agnostic exploration to collect independent image-action pairs, then learns embodiment dynamics through inverse-dynamics training with arm/end-effector decoupling and a direction-aware decoder. These learned dynamics are coupled with arbitrary high-level policies. The authors claim a 51% improvement in test accuracy over a standard baseline and 30-40% higher success rates than strong baselines on five bimanual manipulation tasks (microwave operation, bread toasting, cloth folding, plant watering, plate scrubbing).

Significance. If the central empirical claims are substantiated, the work offers a practical route to scalable, reusable embodiment modeling that decouples dynamics learning from task-specific sequential data, directly addressing data scarcity in visuomotor control. The explicit separation of exploration from policy learning and the direction-aware decoder are concrete technical contributions that could be adopted across platforms.

major comments (2)
  1. [Abstract] Abstract: The central premise that independent (non-sequential) image-action pairs 'have the potential to cover the entire embodiment workspace' is load-bearing for the reported 30-40% success-rate gains, yet the manuscript provides no analysis, coverage metric, or ablation demonstrating that such sampling captures inter-arm coordination and temporally extended contact sequences required for bimanual tasks (e.g., one arm stabilizing an object while the other executes a multi-step action).
  2. [§4] §4 (Experimental evaluation): The reported 51% test-accuracy improvement and 30-40% success-rate gains are presented without baseline implementation details, number of trials, error bars, data-exclusion criteria, or statistical significance tests. These omissions make it impossible to verify that the gains are attributable to the proposed modeling choices rather than implementation differences.
minor comments (2)
  1. [§3.2] The description of the direction-aware decoder would benefit from an explicit equation showing how direction conditioning is injected into the prediction head.
  2. [Figure 4] Figure captions for trajectory visualizations should explicitly label which arm is performing which sub-action to illustrate coordination capture.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. Below we respond point-by-point to the major comments and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] The central premise that independent (non-sequential) image-action pairs 'have the potential to cover the entire embodiment workspace' is load-bearing for the reported 30-40% success-rate gains, yet the manuscript provides no analysis, coverage metric, or ablation demonstrating that such sampling captures inter-arm coordination and temporally extended contact sequences required for bimanual tasks (e.g., one arm stabilizing an object while the other executes a multi-step action).

    Authors: We agree that an explicit analysis of workspace coverage would strengthen the presentation. The automated exploration samples actions independently across the full joint ranges of both arms while enforcing safety constraints, thereby generating configurations that include one arm in a stabilizing or contact-maintaining pose while the other executes motion. The strong empirical performance on coordination-heavy tasks such as cloth folding and plate scrubbing provides supporting evidence that the learned dynamics support these behaviors. In the revision we will add a dedicated analysis subsection that reports coverage metrics, including histograms of relative arm positions and observed contact durations in the collected dataset. revision: yes

  2. Referee: [§4] The reported 51% test-accuracy improvement and 30-40% success-rate gains are presented without baseline implementation details, number of trials, error bars, data-exclusion criteria, or statistical significance tests. These omissions make it impossible to verify that the gains are attributable to the proposed modeling choices rather than implementation differences.

    Authors: We acknowledge that the current experimental section lacks sufficient detail for full reproducibility and verification. In the revised manuscript we will expand §4 to include complete descriptions of all baseline implementations, the exact number of evaluation trials performed per task, standard error bars on all reported success rates, any data-exclusion criteria that were applied, and the results of statistical significance tests (paired t-tests) on the observed improvements. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no reductions by construction or self-citation

full rationale

The paper's pipeline starts from automated generation of independent image-action pairs via task-agnostic exploration, trains an inverse-dynamics model with arm/end-effector decoupling and a direction-aware decoder, then couples the resulting embodiment representation to high-level policies. Reported gains (51% test accuracy, 30-40% success-rate lifts) are presented as empirical outcomes on specific manipulation tasks rather than quantities forced by the training fit or by redefinition of inputs. No load-bearing step equates a prediction to its own fitted parameters, renames a known result, or relies on a uniqueness theorem imported from the authors' prior work. The coverage assumption for independent pairs is stated as a premise but is not smuggled in via self-citation or made true by definition; it remains an externally falsifiable modeling choice. The derivation is therefore data-driven and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that random safe image-action pairs can densely sample the workspace and that decoupling arm and end-effector motions plus direction awareness will stabilize predictions; no explicit free parameters or new invented entities are stated in the abstract.

axioms (1)
  • domain assumption Independent image-action pairs from automated exploration can cover the entire embodiment workspace and capture physically feasible actions
    Explicitly contrasted with sequential task-specific data in the abstract as the key enabler for scalable reuse.

pith-pipeline@v0.9.0 · 5853 in / 1322 out tokens · 47876 ms · 2026-05-19T03:48:33.430414+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

    cs.CV 2026-05 conditional novelty 7.0

    CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...

  2. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  3. HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

  4. VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

    cs.RO 2026-04 unverdicted novelty 6.0

    VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

  5. Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

    cs.RO 2026-04 unverdicted novelty 6.0

    Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.

  6. Ctrl-World: A Controllable Generative World Model for Robot Manipulation

    cs.RO 2025-10 unverdicted novelty 6.0

    A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.

  7. Vidar: Embodied Video Diffusion Model for Generalist Manipulation

    cs.LG 2025-07 unverdicted novelty 6.0

    Vidar shows that a video diffusion prior continuously pre-trained on 750K multi-view robot trajectories plus a label-free masked inverse dynamics adapter can generalize manipulation to new robot embodiments with 1% of...

  8. StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

    cs.RO 2026-04 unverdicted novelty 5.0

    StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 8 Pith papers · 19 internal anchors

  1. [1]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialu Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yuehan Niu, Yixuan Pan, Jiangmiao Pang, Yu Qiao, Guanghui Ren, Cheng Ruan, Jiaqi Shan, Yongjian...

  2. [2]

    Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models

    Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. CoRR, abs/2405.04233, 2024

  3. [3]

    Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. CoRR, abs/2409.16283, 2024

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. A vision-language action flow model for general robot control. arXiv preprint arXiv:2410.24164, 3(6), 2024

  5. [5]

    Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. CoRR, abs/2310.10639, 2023

  6. [6]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  7. [7]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chilam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation. CoRR, abs/2410.06158, 2024

  8. [8]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Robotics: Science and Systems, 2023

  9. [9]

    Deformable convolutional networks

    Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017

  10. [10]

    Vision transformers need registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  11. [11]

    Humanoid-vla: Towards universal humanoid control with visual inte- gration.arXiv preprint arXiv:2502.14795, 2025

    Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, et al. Humanoid-vla: Towards universal humanoid control with visual integration. arXiv preprint arXiv:2502.14795, 2025

  12. [12]

    Learning universal policies via text-guided video generation

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Info...

  13. [13]

    Bridge data: Boosting generalization of robotic skills with cross-domain datasets

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. In Kris Hauser, Dylan A. Shell, and Shoudong Huang, editors, Robotics: Science and Systems XVIII, New York City, NY, USA, June 27 - J...

  14. [14]

    Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024

  15. [15]

    Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

    Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning. arXiv preprint arXiv:2504.18904, 2025. 10

  16. [16]

    Sanketi, Dorsa Sadigh, Chelsea Finn, and Sergey Levine

    Dibya Ghosh, Homer Rich Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Quan Vuong, Ted Xiao, Pannag R. Sanketi, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. In Dana Kulic, Gentiane Venture, Kostas E. Bekr...

  17. [17]

    Maniskill2: A unified benchmark for generalizable manipulation skills

    Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills. In The Eleventh International Conference on Learning Representations

  18. [18]

    Accelerate: Training and inference at scale made simple, efficient and adaptable

    Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022

  19. [19]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  20. [20]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. CoRR, abs/2412.14803, 2024

  21. [21]

    Sanketi, Archit Sharma, Cody Simpson, Quan Vuong, Homer Rich Walke, Blake Wulfe, Ted Xiao, Jonathan Heewon Yang, Arefeh Yavary, Tony Z

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

  22. [22]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

  23. [23]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024

  24. [24]

    Unified Video Action Model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model. CoRR, abs/2503.00200, 2025

  25. [25]

    HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

    Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025

  26. [26]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024

  27. [27]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021

  28. [28]

    Orbit: A unified simulation framework for interactive robot learning environments

    Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, et al. Orbit: A unified simulation framework for interactive robot learning environments. IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023. 11

  29. [29]

    RoboTwin: Dual-arm robot benchmark with generative digital twins.arXiv preprint arXiv:2409.02920, 2024

    Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins (early version). CoRR, abs/2409.02920, 2024

  30. [30]

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alexander Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew E. Wang, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie, Anthony Brohan, Antonin Raf...

  31. [31]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patrick L...

  32. [32]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019

  33. [33]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025

  34. [34]

    Diffusion Policy Policy Optimization

    Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588, 2024. 12

  35. [35]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  36. [36]

    Manibox: Enhancing spatial grasping generalization via scalable simulation data generation

    Hengkai Tan, Xuezhou Xu, Chengyang Ying, Xinyi Mao, Songming Liu, Xingxing Zhang, Hang Su, and Jun Zhu. Manibox: Enhancing spatial grasping generalization via scalable simulation data generation. CoRR, abs/2411.01850, 2024

  37. [37]

    Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

    Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation. CoRR, abs/2412.15109, 2024

  38. [38]

    RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

    Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024

  39. [39]

    Learning interactive real-world simulators

    Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. In The Twelfth In- ternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024

  40. [40]

    3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. In ICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, 2024

  41. [41]

    Agibot digitalworld

    Jiyao Zhang, Mingjie Pan, Baifeng Xie, Yinghao Zhao, Wenlong Gao, Guangte Xiang, Jiawei Zhang, Dong Li, Zhijun Li, Sheng Zhang, Hongwei Fan, Chengyue Zhao, Shukai Yang, Maoqing Yao, Chuanzhe Suo, and Hao Dong. Agibot digitalworld. https://agibot-digitalworld.com/, 2025

  42. [42]

    Robodreamer: Learning compositional world models for robot imagination

    Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

  43. [43]

    Grasp the Blue Cube

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 13 A More Results A.1 Demonstration of Cross-Arm Interference To investigate potentia...