pith. machine review for the scientific record. sign in

arxiv: 2605.13452 · v1 · submitted 2026-05-13 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

CUBic: Coordinated Unified Bimanual Perception and Control Framework

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:02 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords bimanual manipulationvisuomotor policyrobot coordinationdiffusion policytokenized representationcodebook coordinationperception aggregation
0
0 comments X

The pith

CUBic unifies bimanual robot perception and control in a shared tokenized representation where independence and coordination arise from structure alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the difficulty of extending single-arm visuomotor policies to two arms that must act both independently and in coordination. It does so by learning one tokenized representation that bridges visual perception and motor control, using unidirectional aggregation to gather information and bidirectional codebooks to align the arms. This structure is meant to let the necessary independence and coordination appear without separate hand-crafted rules for coupling the arms. A sympathetic reader would care because current methods force a choice between fully decoupling the arms or imposing strong artificial links, both of which limit performance on realistic tasks. Experiments on the RoboTwin benchmark are presented to show gains in coordination accuracy and overall task success over existing baselines.

Core claim

CUBic reformulates bimanual coordination as a unified perceptual modeling problem. It learns a shared tokenized representation that bridges perception and control, so that independence and coordination emerge intrinsically from the structure rather than from hand-crafted coupling mechanisms. The approach combines unidirectional perception aggregation, bidirectional perception coordination through two codebooks that share a mapping, and a single perception-to-control diffusion policy.

What carries the argument

Shared tokenized representation formed by unidirectional perception aggregation and bidirectional coordination via two codebooks with shared mapping.

If this is right

  • Bimanual tasks can be solved with higher coordination accuracy without explicit inter-arm coupling mechanisms.
  • Task success rates rise on the RoboTwin benchmark relative to decoupled or strongly coupled visuomotor baselines.
  • Perception and control are handled inside one diffusion policy that operates on the shared tokenized representation.
  • The same structural design supports both independent arm motion and coordinated interaction without separate modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The tokenized approach could be tested on multi-robot teams to see whether similar emergence of coordination occurs at larger scale.
  • Removing the need for hand-crafted coupling rules may shorten the engineering effort required to adapt the method to new bimanual tasks.
  • Applying the framework to real physical robots would show whether the benchmark gains persist outside simulation.

Load-bearing premise

The shared tokenized representation produced by unidirectional aggregation and bidirectional codebook coordination will let both arm independence and coordination appear on their own without any added coupling rules.

What would settle it

Running the same benchmark tasks after removing the bidirectional codebook coordination and finding no drop in coordination accuracy or task success rates compared with the full model.

Figures

Figures reproduced from arXiv: 2605.13452 by Donglin Wang, Jingkai Xu, Pengxiang Ding, Xingyu Wang, Zhaoxin Fan.

Figure 1
Figure 1. Figure 1: Overall Framework. Our proposed unified framework for bimanual perception and control collaboration. The unidirectional perception aggregation module leverages a unified masked attention mechanism to learn bimanual-relevant perceptual information from the top view; the bidirectional perception coordination module employs a dual-codebook shared mapping mechanism to achieve implicit coordination of bimanual … view at source ↗
Figure 2
Figure 2. Figure 2: Unidirectional Perception Aggregation. Multi-view images are encoded using independent ResNet-18 backbones, while joint information is projected to the same dimensionality via MLPs. To establish an implicit bridge for perceptual collaboration, we introduce learnable latent tokens initialized randomly. In the unidirectional attention mechanism, a masking strategy ensures that each arm learns relevant inform… view at source ↗
Figure 3
Figure 3. Figure 3: Training Recipe. We employ a two-stage training paradigm to progressively cultivate collaborative perception and control. In the first stage, actions for each arm are decoded independently, enabling the model to acquire collaborative strategies at the perception level. In the second stage, the decoding processes for both arms are fused while the perception module is frozen, thereby preserving the pre-train… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization in RoboTwin. As illustrated, compared with Diffusion Policy (DP), CUBic demonstrates superior performance in object localization and precise single-arm grasping, while maintaining strong coordination in bimanual manipulation scenarios. Metrics. For each task, we report the mean and standard deviation of the success rate over three random initial seeds. The reuslt for each seed is evaluated ac… view at source ↗
read the original abstract

Recent advances in visuomotor policy learning have enabled robots to perform control directly from visual inputs. Yet, extending such end-to-end learning from single-arm to bimanual manipulation remains challenging due to the need for both independent perception and coordinated interaction between arms. Existing methods typically favor one side -- either decoupling the two arms to avoid interference or enforcing strong cross-arm coupling for coordination -- thus lacking a unified treatment. We propose CUBic, a Coordinated and Unified framework for Bimanual perception and control that reformulates bimanual coordination as a unified perceptual modeling problem. CUBic learns a shared tokenized representation bridging perception and control, where independence and coordination emerge intrinsically from structure rather than from hand-crafted coupling. Our approach integrates three components: unidirectional perception aggregation, bidirectional perception coordination through two codebooks with shared mapping, and a unified perception-to-control diffusion policy. Extensive experiments on the RoboTwin benchmark show that CUBic consistently surpasses standard baselines, achieving marked improvements in coordination accuracy and task success rates over state-of-the-art visuomotor baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes CUBic, a Coordinated Unified Bimanual Perception and Control Framework that reformulates bimanual coordination as a unified perceptual modeling problem. It introduces a shared tokenized representation learned via unidirectional perception aggregation, bidirectional perception coordination through two codebooks with shared mapping, and a unified perception-to-control diffusion policy. The central claim is that independence and coordination emerge intrinsically from this structure without hand-crafted coupling mechanisms, with extensive experiments on the RoboTwin benchmark showing consistent outperformance over state-of-the-art visuomotor baselines in coordination accuracy and task success rates.

Significance. If the empirical results hold with proper validation, the work offers a potentially simplifying architectural alternative for bimanual visuomotor policies by avoiding explicit inter-arm coupling, which could advance end-to-end learning for dual-arm systems. The integration of tokenized representations and diffusion policies follows current trends but applies them in a unified bimanual setting.

major comments (2)
  1. [Abstract] Abstract: the claim that CUBic 'consistently surpasses standard baselines' and achieves 'marked improvements in coordination accuracy and task success rates' on RoboTwin supplies no quantitative results, error bars, ablation tables, baseline implementation details, or statistical tests. This absence renders the central empirical claim unverifiable and load-bearing for the paper's contribution.
  2. [Methods] The description of bidirectional codebook coordination and shared mapping (implicit in the methods) lacks explicit equations defining the codebook training objective, the shared mapping function, or how unidirectional aggregation interacts with the diffusion policy. Without these, it is impossible to verify the claim that independence and coordination emerge intrinsically from structure rather than from fitted parameters or hand-crafted elements.
minor comments (1)
  1. [Abstract] Clarify the precise meaning of 'unidirectional perception aggregation' and 'bidirectional perception coordination' with a short definition or pointer to the relevant algorithmic description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our manuscript. We address the major comments below and have revised the paper accordingly to improve clarity and verifiability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that CUBic 'consistently surpasses standard baselines' and achieves 'marked improvements in coordination accuracy and task success rates' on RoboTwin supplies no quantitative results, error bars, ablation tables, baseline implementation details, or statistical tests. This absence renders the central empirical claim unverifiable and load-bearing for the paper's contribution.

    Authors: We agree with the referee that the abstract would benefit from including quantitative highlights to support the claims. In the revised version, we have incorporated key performance metrics from the RoboTwin experiments, such as the coordination accuracy and task success rates with comparisons to baselines, along with pointers to the full tables, error bars, and statistical details in the main body of the paper. This addresses the verifiability concern while keeping the abstract concise. revision: yes

  2. Referee: [Methods] The description of bidirectional codebook coordination and shared mapping (implicit in the methods) lacks explicit equations defining the codebook training objective, the shared mapping function, or how unidirectional aggregation interacts with the diffusion policy. Without these, it is impossible to verify the claim that independence and coordination emerge intrinsically from structure rather than from fitted parameters or hand-crafted elements.

    Authors: We appreciate this observation and have added explicit equations in the revised methods section for the codebook training objective (including commitment and reconstruction losses), the shared mapping function between codebooks, and the interaction with unidirectional aggregation and the diffusion policy. These additions clarify how independence and coordination emerge intrinsically from the structure, as supported by the ablations. revision: yes

Circularity Check

0 steps flagged

No circularity in architectural proposal

full rationale

The paper introduces CUBic as a new framework with three components: unidirectional perception aggregation, bidirectional coordination via two codebooks with shared mapping, and a unified perception-to-control diffusion policy. No equations, derivations, or fitted parameters are presented in the abstract or description that would reduce the claimed coordination emergence or performance gains to quantities defined by construction within the paper. The method is framed as an architectural choice rather than a self-referential fit or prediction derived from its own inputs. Experimental claims on RoboTwin are external validation, not internal reductions. No self-citation chains or uniqueness theorems are invoked in the provided text to bear the central load.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the shared tokenized representation and two codebooks are presented as learned components rather than postulated new entities with independent evidence.

pith-pipeline@v0.9.0 · 5489 in / 1240 out tokens · 25331 ms · 2026-05-14T19:02:28.770787+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 12 canonical work pages · 6 internal anchors

  1. [1]

    H-rdt: Human ma- nipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523, 2025

    Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human ma- nipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523, 2025. 6

  2. [2]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A visi...

  3. [3]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent ac- tions.arXiv preprint arXiv:2505.06111, 2025. 3

  4. [4]

    Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025

    Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025

  5. [5]

    Moto: Latent mo- tion token as the bridging language for learning robot ma- nipulation from videos

    Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent mo- tion token as the bridging language for learning robot ma- nipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19752–19763, 2025. 3

  6. [6]

    Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research,

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research,

  7. [7]

    Efficient bimanual manipulation using learned task schemas

    Rohan Chitnis, Shubham Tulsiani, Saurabh Gupta, and Ab- hinav Gupta. Efficient bimanual manipulation using learned task schemas. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 1149–1155. IEEE,

  8. [8]

    Dynamo: In-domain dynamics pretraining for visuo-motor control.Advances in Neural Information Processing Systems, 37:33933–33961, 2024

    Zichen Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, and Lerrel Pinto. Dynamo: In-domain dynamics pretraining for visuo-motor control.Advances in Neural Information Processing Systems, 37:33933–33961, 2024. 3

  9. [9]

    Safebimanual: Diffusion-based trajectory optimization for safe bimanual manipulation, 2025

    Haoyuan Deng, Wenkai Guo, Qianzhun Wang, Zhenyu Wu, and Ziwei Wang. Safebimanual: Diffusion-based trajectory optimization for safe bimanual manipulation, 2025. 1

  10. [10]

    Interactive imitation learn- ing of bimanual movement primitives.IEEE/ASME Trans- actions on Mechatronics, 2023

    Giovanni Franzese, Leandro de Souza Rosa, Tim Verburg, Luka Peternel, and Jens Kober. Interactive imitation learn- ing of bimanual movement primitives.IEEE/ASME Trans- actions on Mechatronics, 2023. 2

  11. [11]

    Cord- vip: Correspondence-based visuomotor policy for dexterous manipulation in real-world, 2025

    Yankai Fu, Qiuxuan Feng, Ning Chen, Zichen Zhou, Mengzhen Liu, Mingdong Wu, Tianxing Chen, Shanyu Rong, Jiaming Liu, Hao Dong, and Shanghang Zhang. Cord- vip: Correspondence-based visuomotor policy for dexterous manipulation in real-world, 2025. 1

  12. [12]

    Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117,

  13. [13]

    Vita: Vision-to-action flow matching policy, 2026

    Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, and Iman Soltani. Vita: Vision-to-action flow matching policy, 2026. 1

  14. [14]

    Bi-kvil: Keypoints-based visual imitation learning of bimanual manipulation tasks

    Jianfeng Gao, Xiaoshu Jin, Franziska Krebs, No ´emie Jaquier, and Tamim Asfour. Bi-kvil: Keypoints-based visual imitation learning of bimanual manipulation tasks. In2024 IEEE international conference on robotics and automation (ICRA), pages 16850–16857. IEEE, 2024. 2

  15. [15]

    Dag-plan: Generating directed acyclic de- pendency graphs for dual-arm cooperative planning, 2025

    Zeyu Gao, Yao Mu, Jinye Qu, Mengkang Hu, Shijia Peng, Chengkai Hou, Lingyue Guo, Ping Luo, Shanghang Zhang, and Yanfeng Lu. Dag-plan: Generating directed acyclic de- pendency graphs for dual-arm cooperative planning, 2025. 1

  16. [16]

    3d flowmatch actor: Unified 3d policy for single- and dual-arm manipulation, 2025

    Nikolaos Gkanatsios, Jiahe Xu, Matthew Bronars, Arsalan Mousavian, Tsung-Wei Ke, and Katerina Fragkiadaki. 3d flowmatch actor: Unified 3d policy for single- and dual-arm manipulation, 2025. 1

  17. [17]

    Carp: Visuomotor policy learning via coarse-to-fine autore- gressive prediction, 2025

    Zhefei Gong, Pengxiang Ding, Shangke Lyu, Siteng Huang, Mingyang Sun, Wei Zhao, Zhaoxin Fan, and Donglin Wang. Carp: Visuomotor policy learning via coarse-to-fine autore- gressive prediction, 2025. 3

  18. [18]

    Stabilize to act: Learning to coordinate for bimanual manip- ulation

    Jennifer Grannen, Yilin Wu, Brandon Vu, and Dorsa Sadigh. Stabilize to act: Learning to coordinate for bimanual manip- ulation. InConference on Robot Learning, pages 563–576. PMLR, 2023. 1, 2, 3

  19. [19]

    Peract2: Benchmarking and learn- ing for robotic bimanual manipulation tasks

    Markus Grotz, Mohit Shridhar, Yu-Wei Chao, Tamim As- four, and Dieter Fox. Peract2: Benchmarking and learn- ing for robotic bimanual manipulation tasks. InCoRL 2024 Workshop on Whole-body Control and Bimanual Manipula- tion: Applications in Humanoids and Beyond, 2024. 1, 2

  20. [20]

    Maniskill2: A unified benchmark for generalizable manipulation skills

    Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yun- chao Yao, Xiaodi Yuan, Pengwei Xie, Zhiao Huang, Rui Chen, and Hao Su. Maniskill2: A unified benchmark for generalizable manipulation skills. InInternational Confer- ence on Learning Representations, 2023. 2, 6

  21. [21]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.arXiv preprint arxiv:2006.11239,

  22. [22]

    Sampling-based optimization with parallelized physics simulator for bimanual manipulation, 2025

    Iryna Hurova, Alinjar Dan, Karl Kruusam ¨ae, and Arun Ku- mar Singh. Sampling-based optimization with parallelized physics simulator for bimanual manipulation, 2025. 1

  23. [23]

    Twinvla: Data-efficient bimanual manip- ulation with twin single-arm vision-language-action models,

    Hokyun Im, Euijin Jeong, Andrey Kolobov, Jianlong Fu, and Youngwoon Lee. Twinvla: Data-efficient bimanual manip- ulation with twin single-arm vision-language-action models,

  24. [24]

    Lift3d policy: Lifting 2d foundation models for robust 3d robotic manipu- lation

    Yueru Jia, Jiaming Liu, Sixiang Chen, Chenyang Gu, Zhilve Wang, Longzan Luo, Xiaoqi Li, Pengwei Wang, Zhongyuan Wang, Renrui Zhang, and Shanghang Zhang. Lift3d policy: Lifting 2d foundation models for robust 3d robotic manipu- lation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 17347– 17358, 2025. 1

  25. [25]

    3d diffuser actor: Policy diffusion with 3d scene rep- resentations.Arxiv, 2024

    Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragki- adaki. 3d diffuser actor: Policy diffusion with 3d scene rep- resentations.Arxiv, 2024. 1

  26. [26]

    Interact: Inter-dependency aware action chunking with hierarchical attention transformers for bimanual manip- ulation, 2024

    Andrew Lee, Ian Chuang, Ling-Yuan Chen, and Iman Soltani. Interact: Inter-dependency aware action chunking with hierarchical attention transformers for bimanual manip- ulation, 2024. 1, 2

  27. [27]

    Autoregressive image generation using residual quantization

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022. 5

  28. [28]

    Momagen: Generating demonstrations under soft and hard constraints for multi-step bimanual mobile manipulation, 2026

    Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang, Huang Huang, Josiah Wong, Sujay Garlanka, Cem Gokmen, Ruohan Zhang, Weiyu Liu, Jiajun Wu, Roberto Mart´ın-Mart´ın, and Li Fei-Fei. Momagen: Generating demonstrations under soft and hard constraints for multi-step bimanual mobile manipulation, 2026. 1

  29. [29]

    Gr-mg: Leveraging partially annotated data via multi-modal goal-conditioned policy, 2024

    Peiyan Li, Hongtao Wu, Yan Huang, Chilam Cheang, Liang Wang, and Tao Kong. Gr-mg: Leveraging partially annotated data via multi-modal goal-conditioned policy, 2024. 6, 7

  30. [30]

    V oxact-b: V oxel-based acting and stabi- lizing policy for bimanual manipulation.arXiv preprint arXiv:2407.04152, 2024

    I Liu, Chun Arthur, Sicheng He, Daniel Seita, and Gau- rav Sukhatme. V oxact-b: V oxel-based acting and stabi- lizing policy for bimanual manipulation.arXiv preprint arXiv:2407.04152, 2024. 1, 2

  31. [31]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipu- lation.arXiv preprint arXiv:2410.07864, 2024. 1, 2

  32. [32]

    Anybimanual: Transferring unimanual policy for general bimanual manipulation.arXiv preprint arXiv:2412.06779, 2024

    Guanxing Lu, Tengbo Yu, Haoyuan Deng, Season Si Chen, Yansong Tang, and Ziwei Wang. Anybimanual: Transferring unimanual policy for general bimanual manipulation.arXiv preprint arXiv:2412.06779, 2024. 1, 2, 3

  33. [33]

    H 3dp: Triply- hierarchical diffusion policy for visuomotor learning, 2025

    Yiyang Lu, Yufeng Tian, Zhecheng Yuan, Xianbang Wang, Pu Hua, Zhengrong Xue, and Huazhe Xu. H 3dp: Triply- hierarchical diffusion policy for visuomotor learning, 2025. 3, 6

  34. [34]

    Spatial-temporal graph diffusion policy with kinematic mod- eling for bimanual robotic manipulation

    Qi Lv, Hao Li, Xiang Deng, Rui Shao, Yinchuan Li, Jianye Hao, Longxiang Gao, Michael Yu Wang, and Liqiang Nie. Spatial-temporal graph diffusion policy with kinematic mod- eling for bimanual robotic manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17394–17404, 2025. 1

  35. [35]

    Hierarchical diffusion policy for kinematics-aware multi- task robotic manipulation

    Xiao Ma, Sumit Patidar, Iain Haughton, and Stephen James. Hierarchical diffusion policy for kinematics-aware multi- task robotic manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18081–18090, 2024. 1

  36. [36]

    Learning bi- manual manipulation via action chunking and inter-arm co- ordination with transformers, 2025

    Tomohiro Motoda, Ryo Hanai, Ryoichi Nakajo, Masaki Murooka, Floris Erich, and Yukiyasu Domae. Learning bi- manual manipulation via action chunking and inter-arm co- ordination with transformers, 2025. 1

  37. [37]

    Robotwin: Dual-arm robot benchmark with gen- erative digital twins

    Yao Mu, Tianxing Chen, Zanxin Chen, Shijia Peng, Zhiqian Lan, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, Lunkai Lin, Zhiqiang Xie, Mingyu Ding, and Ping Luo. Robotwin: Dual-arm robot benchmark with gen- erative digital twins. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 27649– 27660, 2025. 2, 6, 8

  38. [38]

    3d flow diffusion policy: Visuomotor policy learning via gener- ating flow in 3d space, 2025

    Sangjun Noh, Dongwoo Nam, Kangmin Kim, Geonhyup Lee, Yeonguk Yu, Raeyoung Kang, and Kyoobin Lee. 3d flow diffusion policy: Visuomotor policy learning via gener- ating flow in 3d space, 2025. 1

  39. [39]

    Omnimanip: Towards general robotic manipulation via object-centric interaction primitives as spatial constraints

    Mingjie Pan, Jiyao Zhang, Tianshu Wu, Yinghao Zhao, Wen- long Gao, and Hao Dong. Omnimanip: Towards general robotic manipulation via object-centric interaction primitives as spatial constraints. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17359–17369, 2025. 1

  40. [40]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers.arXiv preprint arXiv:2212.09748, 2022. 6

  41. [41]

    Tokenflow: Unified image tokenizer for multi- modal understanding and generation

    Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xin- glong Wu. Tokenflow: Unified image tokenizer for multi- modal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2545–2555, 2025. 3

  42. [42]

    Learning to act without actions.arXiv preprint arXiv:2312.10812, 2023

    Dominik Schmidt and Minqi Jiang. Learning to act without actions.arXiv preprint arXiv:2312.10812, 2023. 3

  43. [43]

    Biassem- ble: Learning collaborative affordance for bimanual geomet- ric assembly, 2025

    Yan Shen, Ruihai Wu, Yubin Ke, Xinyuan Song, Zeyi Li, Xi- aoqi Li, Hongwei Fan, Haoran Lu, and Hao dong. Biassem- ble: Learning collaborative affordance for bimanual geomet- ric assembly, 2025. 1

  44. [44]

    Denois- ing diffusion implicit models, 2022

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models, 2022. 6

  45. [45]

    Dense policy: Bidirectional autoregressive learning of actions

    Yue Su, Xinyu Zhan, Hongjie Fang, Han Xue, Hao-Shu Fang, Yong-Lu Li, Cewu Lu, and Lixin Yang. Dense policy: Bidirectional autoregressive learning of actions. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 14486–14495, 2025. 2

  46. [46]

    Dspv2: Improved dense policy for effective and generalizable whole-body mobile manipulation, 2025

    Yue Su, Chubin Zhang, Sijin Chen, Liufan Tan, Yansong Tang, Jianan Wang, and Xihui Liu. Dspv2: Improved dense policy for effective and generalizable whole-body mobile manipulation, 2025. 2

  47. [47]

    Anypos: Au- tomated task-agnostic actions for bimanual manipulation,

    Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. Anypos: Au- tomated task-agnostic actions for bimanual manipulation,

  48. [48]

    Vitas: Visual tactile soft fusion contrastive learning for visuomotor learning, 2026

    Yufeng Tian, Shuiqi Cheng, Tianming Wei, Tianxing Zhou, Yuanhang Zhang, Zixian Liu, Qianwei Han, Zhecheng Yuan, and Huazhe Xu. Vitas: Visual tactile soft fusion contrastive learning for visuomotor learning, 2026. 2

  49. [49]

    Vq-vla: Improving vision- language-action models via scaling vector-quantized action tokenizers

    Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao- Shu Fang, and Tong He. Vq-vla: Improving vision- language-action models via scaling vector-quantized action tokenizers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11089– 11099, 2025. 3

  50. [50]

    Unleashing large-scale video generative pre-training for visual robot manipulation, 2023

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation, 2023. 6

  51. [51]

    Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation, 2025

    Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, Zhaoye Long, Yue Wang, Chong Liu, Dihan Wang, Ziqiang Ni, Xiang Yang, You Liu, Ruoxuan Feng, Runtian Xu, Lei Zhang, Denghang Huang, Chenghao Jin, Anlan Yin, Xinlong Wang, Zhenguo Sun, Junkai Zhao, Mengfei Du, Mingyu Cao, Xiansheng Chen, Ho...

  52. [52]

    Deep imita- tion learning for bimanual robotic manipulation.Advances in neural information processing systems, 33:2327–2337,

    Fan Xie, Alexander Chowdhury, M De Paolis Kaluza, Lin- feng Zhao, Lawson Wong, and Rose Yu. Deep imita- tion learning for bimanual robotic manipulation.Advances in neural information processing systems, 33:2327–2337,

  53. [53]

    Diffusion-based imaginative coordination for bimanual manipulation, 2025

    Huilin Xu, Jian Ding, Jiakun Xu, Ruixiang Wang, Jun Chen, Jinjie Mai, Yanwei Fu, Bernard Ghanem, Feng Xu, and Mo- hamed Elhoseiny. Diffusion-based imaginative coordination for bimanual manipulation, 2025. 1, 6

  54. [54]

    Hang Xu, Yizhou Chen, Dongjie Yu, Yi Ren, and Jia Pan. Bikc+: Bimanual hierarchical imitation with keypose- conditioned coordination-aware consistency policies.IEEE Transactions on Automation Science and Engineering, 23: 1064–1079, 2026. 1

  55. [55]

    Speci: Skill prompts based hi- erarchical continual imitation learning for robot manipula- tion.IEEE Transactions on Cognitive and Developmental Systems, 2025

    Jingkai Xu and Xiangli Nie. Speci: Skill prompts based hi- erarchical continual imitation learning for robot manipula- tion.IEEE Transactions on Cognitive and Developmental Systems, 2025. 2

  56. [56]

    Gripper keypose and object pointflow as interfaces for bimanual robotic manipulation, 2025

    Yuyin Yang, Zetao Cai, Yang Tian, Jia Zeng, and Jiangmiao Pang. Gripper keypose and object pointflow as interfaces for bimanual robotic manipulation, 2025. 1

  57. [57]

    Latent Action Pretraining from Videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretrain- ing from videos.arXiv preprint arXiv:2410.11758, 2024. 3

  58. [58]

    3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d repre- sentations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d repre- sentations. InProceedings of Robotics: Science and Systems (RSS), 2024. 1, 2, 3, 6, 7

  59. [59]

    Gener- alizable humanoid manipulation with 3d diffusion policies

    Yanjie Ze, Zixuan Chen, Wenhao Wang, Tianyi Chen, Xi- alin He, Ying Yuan, Xue Bin Peng, and Jiajun Wu. Gener- alizable humanoid manipulation with 3d diffusion policies. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2873–2880, 2025

  60. [60]

    Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow match- ing for robot manipulation, 2024

    Qinglun Zhang, Zhen Liu, Haoqiang Fan, Guanghui Liu, Bing Zeng, and Shuaicheng Liu. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow match- ing for robot manipulation, 2024. 1

  61. [61]

    You only teach once: Learn one-shot bimanual robotic manipulation from video demonstrations, 2025

    Huayi Zhou, Ruixiang Wang, Yunxin Tai, Yueci Deng, Guil- iang Liu, and Kui Jia. You only teach once: Learn one-shot bimanual robotic manipulation from video demonstrations. arXiv preprint arXiv:2501.14208, 2025. 2

  62. [62]

    Learning generalizable manipulation policies with object- centric 3d representations, 2023

    Yifeng Zhu, Zhenyu Jiang, Peter Stone, and Yuke Zhu. Learning generalizable manipulation policies with object- centric 3d representations, 2023. 1