Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems
Pith reviewed 2026-06-26 16:52 UTC · model grok-4.3
The pith
Co-VLA introduces explicit structural priors into vision-language-action models so that dual-arm robots can coordinate tightly coupled tasks through separated shared and residual action latents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Co-VLA replaces the monolithic action head of a vision-language backbone with a Structured Action Expert that applies a modular coordination-aware loss; the loss shapes a shared latent to encode task-level coordination intent and residual latents to encode per-arm execution adjustments, after which a Latent-Aware Controller interprets the learned representations to modulate synchronization strength, execution asymmetry, smoothness, and safety constraints at the joint-command level in real time while remaining compatible with standard control pipelines.
What carries the argument
Structured Action Expert (SAE) whose modular coordination-aware loss separates shared coordination latents from arm-specific residual latents, paired with the Latent-Aware Controller (LAC) that reads those latents to adjust real-time control parameters.
If this is right
- Yields a 27% success-rate gain over monolithic baselines on tight-coordination tasks.
- More than doubles success rate on out-of-distribution real-world scenarios, rising from 13% to 27%.
- Shortens task completion time by up to 25%.
- Operates at the joint-command level and integrates with existing control pipelines without force or impedance sensing.
Where Pith is reading between the lines
- The explicit separation of coordination intent from arm-specific residuals may simplify debugging and partial retraining when only one arm's behavior needs adjustment.
- Because the controller acts on already-learned latents rather than raw observations, the same representations could support transfer to related coordination problems such as multi-robot handover.
- The joint-command compatibility suggests the method could be retrofitted onto existing dual-arm platforms with only a change to the policy head and no alteration to low-level hardware controllers.
Load-bearing premise
The modular coordination-aware loss will shape shared and residual latents according to task-specific structures in a manner that generalizes to unseen real-world conditions without post-hoc tuning of the loss weights or latent dimensions.
What would settle it
Performance on new real-world dual-arm tasks remains at baseline levels when the loss weights and latent dimensions are held fixed at the values used in the reported training runs.
Figures
read the original abstract
Vision-language-action (VLA) models show strong capabilities in single and dual-arm robotic manipulation. Prior works show coordinated bimanual behaviors can emerge from end-to-end learning, leveraging large vision-language backbones with continuous action prediction. However, as bimanual tasks become tightly coupled and execution constraints become critical, implicit coordination alone is insufficient to ensure reliable, interpretable, and stable behavior. In this work, we propose Co-VLA, a coordination-aware bimanual manipulation framework introducing explicit structural priors into VLA models. We instantiate our method on a state-of-the-art vision-language backbone by replacing its monolithic action head with a Structured Action Expert (SAE) designed for bimanual coordination. Specifically, we introduce explicit structure at the action generation level with a modular coordination-aware loss that shapes shared and residual latents according to task-specific structures. The shared latent encodes task-level coordination intent, while residual latents capture execution adjustments for each arm. At deployment, a Latent-Aware Controller (LAC) interprets the learned representations to modulate synchronization strength, execution asymmetry, smoothness, and safety constraints in real time. LAC operates at the joint-command level and remains compatible with standard control pipelines without requiring force or impedance control. Experiments across simulation and real-world benchmarks show Co-VLA significantly outperforms monolithic baselines, achieving a 27% success rate gain in tight-coordination tasks, more than doubling performance in OOD real-world scenarios (from 13% to 27%), and reducing task completion time by up to 25%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Co-VLA, a coordination-aware bimanual VLA framework that replaces the monolithic action head of a vision-language backbone with a Structured Action Expert (SAE) using a modular coordination-aware loss to shape shared (task-level coordination) and residual (per-arm execution) latents, plus a Latent-Aware Controller (LAC) that modulates synchronization, asymmetry, smoothness, and safety at the joint-command level. It reports empirical gains over monolithic baselines: 27% success-rate improvement in tight-coordination tasks, more than doubling OOD real-world success (13% to 27%), and up to 25% reduction in task completion time.
Significance. If the reported performance deltas prove robust, the explicit structural priors could improve reliability and interpretability for tightly coupled dual-arm tasks where implicit coordination from large VL backbones falls short, while remaining compatible with standard control pipelines. The work is an empirical architecture contribution rather than a parameter-free derivation or machine-checked proof.
major comments (1)
- [Abstract / Results] Abstract and results: quantitative claims (27% success gain, OOD doubling from 13% to 27%, 25% time reduction) are presented without accompanying information on baselines, number of trials, statistical significance, or exclusion criteria, preventing evaluation of the central performance claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment point-by-point below and will incorporate revisions to improve clarity.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and results: quantitative claims (27% success gain, OOD doubling from 13% to 27%, 25% time reduction) are presented without accompanying information on baselines, number of trials, statistical significance, or exclusion criteria, preventing evaluation of the central performance claim.
Authors: We agree that the abstract would benefit from additional context to support evaluation of the reported gains. In the revised manuscript, we will expand the abstract to briefly specify the baselines (monolithic VLA models without structured action modeling), note the number of trials conducted across simulation and real-world settings, and indicate that performance differences were assessed for statistical significance. The results section of the full paper already details trial counts, variance, statistical tests, and exclusion criteria for failed or invalid runs; we will add explicit forward references from the abstract and results summary to these details. This will make the central claims more transparent without altering the reported numbers. revision: yes
Circularity Check
No significant circularity
full rationale
The manuscript describes an empirical architecture modification to VLA models via a Structured Action Expert (SAE) and Latent-Aware Controller (LAC), together with a modular coordination-aware loss that shapes shared and residual latents. No equations, derivations, or parameter-fitting procedures are exhibited anywhere in the provided text. Performance claims (27% success-rate gain, OOD doubling, 25% time reduction) are presented as outcomes of simulation and real-world benchmarks against monolithic baselines rather than quantities obtained by construction from fitted inputs or self-citations. The central argument therefore remains self-contained and externally falsifiable through the reported experiments.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Structured Action Expert (SAE)
no independent evidence
-
Latent-Aware Controller (LAC)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
RT-2: Vision-language-action models transfer web knowledge to robotic control,
A. Brohan, N. Brown, J. Carbajal, et al., “RT-2: Vision-language-action models transfer web knowledge to robotic control,”arXiv preprint arXiv:2307.15818, 2023
Pith/arXiv arXiv 2023
-
[2]
OpenVLA: An open-source vision-language-action model,
M. J. Kim, K. Pertsch, S. Karamcheti, et al., “OpenVLA: An open-source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024
Pith/arXiv arXiv 2024
-
[3]
Octo: An open-source generalist robot policy,
Octo Model Team, D. Ghosh, H. Walke, et al., “Octo: An open-source generalist robot policy,” inProc. Robotics: Science and Systems (RSS), 2024
2024
-
[4]
π 0: A vision-language- action flow model for general robot control,
K. Black, N. Brown, D. Driess, et al., “π 0: A vision-language- action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024
Pith/arXiv arXiv 2024
-
[5]
π 0.5: A vision-language-action model with open-world generalization,
Physical Intelligence, “π 0.5: A vision-language-action model with open-world generalization,”arXiv preprint arXiv:2504.16054, 2025
Pith/arXiv arXiv 2025
-
[6]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, Z. Xu, S. Feng, et al., “Diffusion policy: Visuomotor policy learning via action diffusion,”Int. J. Robotics Research, 2025
2025
-
[7]
RDT-1B: A diffusion foundation model for bimanual manipulation,
S. Liu, L. Wu, B. Li, et al., “RDT-1B: A diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024
Pith/arXiv arXiv 2024
-
[8]
Learning fine-grained bimanual manipulation with low-cost hardware,
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” inProc. Robotics: Science and Systems (RSS), 2023
2023
-
[9]
Mobile ALOHA: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,
Z. Fu, T. Z. Zhao, and C. Finn, “Mobile ALOHA: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,” in Conf. Robot Learning (CoRL), 2024
2024
-
[10]
ALOHA unleashed: A simple recipe for robot dexterity,
T. Zhao, et al., “ALOHA unleashed: A simple recipe for robot dexterity,”arXiv preprint arXiv:2410.13126, 2024
arXiv 2024
-
[11]
RoboTwin: Dual-arm robot bench- mark with generative digital twins,
Y . Mu, T. Chen, S. Peng, et al., “RoboTwin: Dual-arm robot bench- mark with generative digital twins,” inECCV Workshop, 2024
2024
-
[12]
T. Chen, et al., “RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual manipulation,”arXiv preprint arXiv:2506.18088, 2025
Pith/arXiv arXiv 2025
-
[13]
A unified approach to motion and force control of robot manipulators: The operational space formulation,
O. Khatib, “A unified approach to motion and force control of robot manipulators: The operational space formulation,”IEEE J. Robotics and Automation, vol. 3, no. 1, pp. 43–53, 1987
1987
-
[14]
The virtual linkage: A model for internal forces in multi-grasp manipulation,
D. Williams and O. Khatib, “The virtual linkage: A model for internal forces in multi-grasp manipulation,” inProc. IEEE Int. Conf. Robotics and Automation, 1993, pp. 1025–1030
1993
-
[15]
A symmetric hybrid position/force control scheme for the coordination of two robots,
M. Uchiyama and P. Dauchez, “A symmetric hybrid position/force control scheme for the coordination of two robots,” inProc. IEEE Int. Conf. Robotics and Automation, 1988, pp. 350–356
1988
-
[16]
Dual arm manipulation—A survey,
C. Smith, Y . Karayiannidis, L. Nalpantidis, et al., “Dual arm manipulation—A survey,”Robotics and Autonomous Systems, vol. 60, no. 10, pp. 1340–1353, 2012
2012
-
[17]
Cooperative manipulation,
F. Caccavale and M. Uchiyama, “Cooperative manipulation,” in Springer Handbook of Robotics, B. Siciliano and O. Khatib, Eds., 2016, pp. 989–1006
2016
-
[18]
A uni- fied framework for coordinated multi-arm motion planning,
S. S. Mirrazavi Salehian, N. Figueroa, and A. Billard, “A uni- fied framework for coordinated multi-arm motion planning,”Int. J. Robotics Research, vol. 37, no. 13–14, pp. 1765–1797, 2018
2018
-
[19]
An overview of multi-task learning in deep neural net- works,
S. Ruder, “An overview of multi-task learning in deep neural net- works,”arXiv preprint arXiv:1706.05098, 2017
Pith/arXiv arXiv 2017
-
[20]
QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learn- ing,
T. Rashid, M. Samvelyan, C. S. de Witt, et al., “QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learn- ing,” inProc. ICML, 2018
2018
-
[21]
The surprising effectiveness of PPO in cooperative multi-agent games,
C. Yu, A. Velu, E. Vinitsky, et al., “The surprising effectiveness of PPO in cooperative multi-agent games,” inProc. NeurIPS, 2022
2022
-
[22]
Abstracting robot manipula- tion skills via mixture-of-experts diffusion policies,
C. Hao, X. Zhai, Y . Liu, and H. Soh, “Abstracting robot manipula- tion skills via mixture-of-experts diffusion policies,”arXiv preprint arXiv:2601.21251, 2026
arXiv 2026
-
[23]
Learning bimanual manipulation via action chunking and inter-arm coordination with transformers,
T. Motoda, R. Hanai, R. Nakajo, M. Murooka, F. Erich, and Y . Domae, “Learning bimanual manipulation via action chunking and inter-arm coordination with transformers,”arXiv preprint arXiv:2503.13916, 2025
arXiv 2025
-
[24]
T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling, “Residual policy learning,”arXiv preprint arXiv:1812.06298, 2018
Pith/arXiv arXiv 2018
-
[25]
Residual reinforcement learning for robot control,
T. Johannink, S. Bahl, A. Nair, et al., “Residual reinforcement learning for robot control,” inProc. IEEE Int. Conf. Robotics and Automation, 2019
2019
-
[26]
Control barrier func- tions: Theory and applications,
A. D. Ames, S. Coogan, M. Egerstedt, et al., “Control barrier func- tions: Theory and applications,” inProc. European Control Confer- ence, 2019
2019
-
[27]
Six- DOF impedance control of dual-arm cooperative manipulators,
F. Caccavale, P. Chiacchio, A. Marino, and L. Villani, “Six- DOF impedance control of dual-arm cooperative manipulators,” IEEE/ASME Trans. Mechatronics, vol. 13, no. 5, pp. 576–586, 2008
2008
-
[28]
Impedance behaviors for two- handed manipulation: Design and experiments,
T. Wimb ¨ock, C. Ott, and G. Hirzinger, “Impedance behaviors for two- handed manipulation: Design and experiments,” inProc. IEEE Int. Conf. Robotics and Automation, 2007, pp. 4182–4189
2007
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.