AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation
Pith reviewed 2026-05-19 03:48 UTC · model grok-4.3
The pith
AnyPos raises bimanual manipulation success rates by 30-40% using independent task-agnostic image-action pairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AnyPos integrates large-scale automated task-agnostic exploration to generate diverse safe trajectories with robust embodiment modeling via inverse dynamics learning. It decouples arm and end-effector motions and uses a direction-aware decoder to stabilize predictions under distribution shift. The resulting representations couple directly with diverse high-level policies, producing a 51% improvement in test accuracy and 30-40% higher success rates on tasks including operating a microwave, toasting bread, folding clothes, watering plants, and scrubbing plates.
What carries the argument
Task-agnostic embodiment modeling via inverse dynamics learning on independent image-action pairs, with arm and end-effector decoupling plus a direction-aware decoder.
If this is right
- Action data becomes reusable across tasks because it is no longer tied to sequential task execution.
- Embodiment representations can be learned once and then paired with many different high-level policies.
- Test accuracy improves by 51% over the standard baseline through stabilized inverse dynamics predictions.
- Success rates rise 30-40% on concrete bimanual tasks such as microwave operation and plate scrubbing.
- Cross-platform transfer becomes feasible by keeping embodiment dynamics separate from task logic.
Where Pith is reading between the lines
- Exploration data collected once on one robot body could support policy learning for similar but not identical arms.
- The same pipeline might reduce reliance on human demonstrations when extending to mobile manipulators.
- Direction-aware decoding could be tested for stability under real-world lighting or camera changes.
- Pairing the learned representations with model-based planners might extend performance to longer sequences.
Load-bearing premise
Independent image-action pairs generated by automated task-agnostic exploration can sufficiently cover the full embodiment workspace and capture all physically feasible actions without sequential task-specific data.
What would settle it
Applying AnyPos to a new bimanual task outside the exploration coverage and finding that success rates drop to match or fall below strong baselines.
Figures
read the original abstract
Learning generalizable manipulation policies hinges on data, yet robot manipulation data is scarce and often entangled with specific embodiments, making both cross-task and cross-platform transfer difficult. We tackle this challenge with task-agnostic embodiment modeling, which learns embodiment dynamics directly from task-agnostic action data and decouples them from high-level policy learning. By focusing on exploring all feasible actions of the embodiment to capture what is physically feasible and consistent, task-agnostic data takes the form of independent image-action pairs with the potential to cover the entire embodiment workspace, unlike task-specific data, which is sequential and tied to concrete tasks. This data-driven perspective bypasses the limitations of traditional dynamics-based modeling and enables scalable reuse of action data across different tasks. Building on this principle, we introduce AnyPos, a unified pipeline that integrates large-scale automated task-agnostic exploration with robust embodiment modeling through inverse dynamics learning. AnyPos generates diverse yet safe trajectories at scale, then learns embodiment representations by decoupling arm and end-effector motions and employing a direction-aware decoder to stabilize predictions under distribution shift, which can be seamlessly coupled with diverse high-level policy models. In comparison to the standard baseline, AnyPos achieves a 51% improvement in test accuracy. On manipulation tasks such as operating a microwave, toasting bread, folding clothes, watering plants, and scrubbing plates, AnyPos raises success rates by 30-40% over strong baselines. These results highlight data-driven embodiment modeling as a practical route to overcoming data scarcity and achieving generalization across tasks and platforms in visuomotor control. Project page: https://embodiedfoundation.github.io/vidar_anypos.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AnyPos, a unified pipeline that performs large-scale automated task-agnostic exploration to collect independent image-action pairs, then learns embodiment dynamics through inverse-dynamics training with arm/end-effector decoupling and a direction-aware decoder. These learned dynamics are coupled with arbitrary high-level policies. The authors claim a 51% improvement in test accuracy over a standard baseline and 30-40% higher success rates than strong baselines on five bimanual manipulation tasks (microwave operation, bread toasting, cloth folding, plant watering, plate scrubbing).
Significance. If the central empirical claims are substantiated, the work offers a practical route to scalable, reusable embodiment modeling that decouples dynamics learning from task-specific sequential data, directly addressing data scarcity in visuomotor control. The explicit separation of exploration from policy learning and the direction-aware decoder are concrete technical contributions that could be adopted across platforms.
major comments (2)
- [Abstract] Abstract: The central premise that independent (non-sequential) image-action pairs 'have the potential to cover the entire embodiment workspace' is load-bearing for the reported 30-40% success-rate gains, yet the manuscript provides no analysis, coverage metric, or ablation demonstrating that such sampling captures inter-arm coordination and temporally extended contact sequences required for bimanual tasks (e.g., one arm stabilizing an object while the other executes a multi-step action).
- [§4] §4 (Experimental evaluation): The reported 51% test-accuracy improvement and 30-40% success-rate gains are presented without baseline implementation details, number of trials, error bars, data-exclusion criteria, or statistical significance tests. These omissions make it impossible to verify that the gains are attributable to the proposed modeling choices rather than implementation differences.
minor comments (2)
- [§3.2] The description of the direction-aware decoder would benefit from an explicit equation showing how direction conditioning is injected into the prediction head.
- [Figure 4] Figure captions for trajectory visualizations should explicitly label which arm is performing which sub-action to illustrate coordination capture.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. Below we respond point-by-point to the major comments and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] The central premise that independent (non-sequential) image-action pairs 'have the potential to cover the entire embodiment workspace' is load-bearing for the reported 30-40% success-rate gains, yet the manuscript provides no analysis, coverage metric, or ablation demonstrating that such sampling captures inter-arm coordination and temporally extended contact sequences required for bimanual tasks (e.g., one arm stabilizing an object while the other executes a multi-step action).
Authors: We agree that an explicit analysis of workspace coverage would strengthen the presentation. The automated exploration samples actions independently across the full joint ranges of both arms while enforcing safety constraints, thereby generating configurations that include one arm in a stabilizing or contact-maintaining pose while the other executes motion. The strong empirical performance on coordination-heavy tasks such as cloth folding and plate scrubbing provides supporting evidence that the learned dynamics support these behaviors. In the revision we will add a dedicated analysis subsection that reports coverage metrics, including histograms of relative arm positions and observed contact durations in the collected dataset. revision: yes
-
Referee: [§4] The reported 51% test-accuracy improvement and 30-40% success-rate gains are presented without baseline implementation details, number of trials, error bars, data-exclusion criteria, or statistical significance tests. These omissions make it impossible to verify that the gains are attributable to the proposed modeling choices rather than implementation differences.
Authors: We acknowledge that the current experimental section lacks sufficient detail for full reproducibility and verification. In the revised manuscript we will expand §4 to include complete descriptions of all baseline implementations, the exact number of evaluation trials performed per task, standard error bars on all reported success rates, any data-exclusion criteria that were applied, and the results of statistical significance tests (paired t-tests) on the observed improvements. revision: yes
Circularity Check
Derivation chain is self-contained; no reductions by construction or self-citation
full rationale
The paper's pipeline starts from automated generation of independent image-action pairs via task-agnostic exploration, trains an inverse-dynamics model with arm/end-effector decoupling and a direction-aware decoder, then couples the resulting embodiment representation to high-level policies. Reported gains (51% test accuracy, 30-40% success-rate lifts) are presented as empirical outcomes on specific manipulation tasks rather than quantities forced by the training fit or by redefinition of inputs. No load-bearing step equates a prediction to its own fitted parameters, renames a known result, or relies on a uniqueness theorem imported from the authors' prior work. The coverage assumption for independent pairs is stated as a premise but is not smuggled in via self-citation or made true by definition; it remains an externally falsifiable modeling choice. The derivation is therefore data-driven and self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Independent image-action pairs from automated exploration can cover the entire embodiment workspace and capture physically feasible actions
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
task-agnostic data takes the form of independent image-action pairs with the potential to cover the entire embodiment workspace, unlike task-specific data, which is sequential
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Arm-Decoupled Estimation ... Direction-Aware Decoder (DAD) ... multi-scale dilated convolutions, deformable convolutions, angle-sensitive pooling
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 8 Pith papers
-
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
-
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
-
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.
-
Vidar: Embodied Video Diffusion Model for Generalist Manipulation
Vidar shows that a video diffusion prior continuously pre-trained on 750K multi-view robot trajectories plus a label-free masked inverse dynamics adapter can generalize manipulation to new robot embodiments with 1% of...
-
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
Reference graph
Works this paper leans on
-
[1]
AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialu Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yuehan Niu, Yixuan Pan, Jiangmiao Pang, Yu Qiao, Guanghui Ren, Cheng Ruan, Jiaqi Shan, Yongjian...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models
Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. CoRR, abs/2405.04233, 2024
-
[3]
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. CoRR, abs/2409.16283, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. A vision-language action flow model for general robot control. arXiv preprint arXiv:2410.24164, 3(6), 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. CoRR, abs/2310.10639, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Chilam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation. CoRR, abs/2410.06158, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Robotics: Science and Systems, 2023
work page 2023
-
[9]
Deformable convolutional networks
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017
work page 2017
-
[10]
Vision transformers need registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024
work page 2024
-
[11]
Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, et al. Humanoid-vla: Towards universal humanoid control with visual integration. arXiv preprint arXiv:2502.14795, 2025
-
[12]
Learning universal policies via text-guided video generation
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Info...
work page 2023
-
[13]
Bridge data: Boosting generalization of robotic skills with cross-domain datasets
Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. In Kris Hauser, Dylan A. Shell, and Shoudong Huang, editors, Robotics: Science and Systems XVIII, New York City, NY, USA, June 27 - J...
work page 2022
-
[14]
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning. arXiv preprint arXiv:2504.18904, 2025. 10
-
[16]
Sanketi, Dorsa Sadigh, Chelsea Finn, and Sergey Levine
Dibya Ghosh, Homer Rich Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Quan Vuong, Ted Xiao, Pannag R. Sanketi, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. In Dana Kulic, Gentiane Venture, Kostas E. Bekr...
work page 2024
-
[17]
Maniskill2: A unified benchmark for generalizable manipulation skills
Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills. In The Eleventh International Conference on Learning Representations
-
[18]
Accelerate: Training and inference at scale made simple, efficient and adaptable
Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022
work page 2022
-
[19]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[20]
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. CoRR, abs/2412.14803, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...
work page 2024
-
[22]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model. CoRR, abs/2503.00200, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[28]
Orbit: A unified simulation framework for interactive robot learning environments
Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, et al. Orbit: A unified simulation framework for interactive robot learning environments. IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023. 11
work page 2023
-
[29]
Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins (early version). CoRR, abs/2409.02920, 2024
-
[30]
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alexander Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew E. Wang, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie, Anthony Brohan, Antonin Raf...
work page 2024
-
[31]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patrick L...
work page 2024
-
[32]
Pytorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019
work page 2019
-
[33]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Diffusion Policy Policy Optimization
Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588, 2024. 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[36]
Manibox: Enhancing spatial grasping generalization via scalable simulation data generation
Hengkai Tan, Xuezhou Xu, Chengyang Ying, Xinyi Mao, Songming Liu, Xingxing Zhang, Hang Su, and Jun Zhu. Manibox: Enhancing spatial grasping generalization via scalable simulation data generation. CoRR, abs/2411.01850, 2024
-
[37]
Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation
Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation. CoRR, abs/2412.15109, 2024
work page internal anchor Pith review arXiv 2024
-
[38]
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Learning interactive real-world simulators
Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. In The Twelfth In- ternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024
work page 2024
-
[40]
3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. In ICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, 2024
work page 2024
-
[41]
Jiyao Zhang, Mingjie Pan, Baifeng Xie, Yinghao Zhao, Wenlong Gao, Guangte Xiang, Jiawei Zhang, Dong Li, Zhijun Li, Sheng Zhang, Hongwei Fan, Chengyue Zhao, Shukai Yang, Maoqing Yao, Chuanzhe Suo, and Hao Dong. Agibot digitalworld. https://agibot-digitalworld.com/, 2025
work page 2025
-
[42]
Robodreamer: Learning compositional world models for robot imagination
Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024
work page 2024
-
[43]
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 13 A More Results A.1 Demonstration of Cross-Arm Interference To investigate potentia...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.