pith. sign in

arxiv: 2606.20285 · v1 · pith:N3XN6UOLnew · submitted 2026-06-18 · 💻 cs.RO

Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems

Pith reviewed 2026-06-26 16:52 UTC · model grok-4.3

classification 💻 cs.RO
keywords dual-arm manipulationvision-language-actionbimanual coordinationstructured action modelingcoordination-aware lossrobotic manipulationlatent representations
0
0 comments X

The pith

Co-VLA introduces explicit structural priors into vision-language-action models so that dual-arm robots can coordinate tightly coupled tasks through separated shared and residual action latents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that monolithic end-to-end VLA models are insufficient for reliable bimanual coordination once tasks impose tight coupling and execution constraints, and that adding explicit structure at the action head remedies this. It does so by replacing the single action predictor with a Structured Action Expert whose modular coordination-aware loss forces a shared latent to carry task-level intent while residual latents carry arm-specific corrections. A Latent-Aware Controller then reads these latents at runtime to adjust synchronization, asymmetry, smoothness, and safety directly in the joint-command stream. If the claim holds, dual-arm systems could achieve stable, interpretable behavior on assembly, handover, and similar tasks without custom force or impedance controllers and with better out-of-distribution robustness.

Core claim

Co-VLA replaces the monolithic action head of a vision-language backbone with a Structured Action Expert that applies a modular coordination-aware loss; the loss shapes a shared latent to encode task-level coordination intent and residual latents to encode per-arm execution adjustments, after which a Latent-Aware Controller interprets the learned representations to modulate synchronization strength, execution asymmetry, smoothness, and safety constraints at the joint-command level in real time while remaining compatible with standard control pipelines.

What carries the argument

Structured Action Expert (SAE) whose modular coordination-aware loss separates shared coordination latents from arm-specific residual latents, paired with the Latent-Aware Controller (LAC) that reads those latents to adjust real-time control parameters.

If this is right

  • Yields a 27% success-rate gain over monolithic baselines on tight-coordination tasks.
  • More than doubles success rate on out-of-distribution real-world scenarios, rising from 13% to 27%.
  • Shortens task completion time by up to 25%.
  • Operates at the joint-command level and integrates with existing control pipelines without force or impedance sensing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit separation of coordination intent from arm-specific residuals may simplify debugging and partial retraining when only one arm's behavior needs adjustment.
  • Because the controller acts on already-learned latents rather than raw observations, the same representations could support transfer to related coordination problems such as multi-robot handover.
  • The joint-command compatibility suggests the method could be retrofitted onto existing dual-arm platforms with only a change to the policy head and no alteration to low-level hardware controllers.

Load-bearing premise

The modular coordination-aware loss will shape shared and residual latents according to task-specific structures in a manner that generalizes to unseen real-world conditions without post-hoc tuning of the loss weights or latent dimensions.

What would settle it

Performance on new real-world dual-arm tasks remains at baseline levels when the loss weights and latent dimensions are held fixed at the values used in the reported training runs.

Figures

Figures reproduced from arXiv: 2606.20285 by Chao Zhang, Daehyun Ji, Dongwook Lee, Jaewook Yoo, Jiaqian Yu, Lu Xu, Mingbo Zhao, Weiming Li, Xiongfeng Peng, Yamin Mao, Yandong Wang.

Figure 1
Figure 1. Figure 1: Overview of our dual-arm VLA system. We design a Structured Action Expert (SAE) that interprets the hidden [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Structured Action Expert (SAE) architecture. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of sequential motion paradigm (left) and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of out-of-distribution conditions for real [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of LAC on joint trajectories. LAC produces [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Vision-language-action (VLA) models show strong capabilities in single and dual-arm robotic manipulation. Prior works show coordinated bimanual behaviors can emerge from end-to-end learning, leveraging large vision-language backbones with continuous action prediction. However, as bimanual tasks become tightly coupled and execution constraints become critical, implicit coordination alone is insufficient to ensure reliable, interpretable, and stable behavior. In this work, we propose Co-VLA, a coordination-aware bimanual manipulation framework introducing explicit structural priors into VLA models. We instantiate our method on a state-of-the-art vision-language backbone by replacing its monolithic action head with a Structured Action Expert (SAE) designed for bimanual coordination. Specifically, we introduce explicit structure at the action generation level with a modular coordination-aware loss that shapes shared and residual latents according to task-specific structures. The shared latent encodes task-level coordination intent, while residual latents capture execution adjustments for each arm. At deployment, a Latent-Aware Controller (LAC) interprets the learned representations to modulate synchronization strength, execution asymmetry, smoothness, and safety constraints in real time. LAC operates at the joint-command level and remains compatible with standard control pipelines without requiring force or impedance control. Experiments across simulation and real-world benchmarks show Co-VLA significantly outperforms monolithic baselines, achieving a 27% success rate gain in tight-coordination tasks, more than doubling performance in OOD real-world scenarios (from 13% to 27%), and reducing task completion time by up to 25%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes Co-VLA, a coordination-aware bimanual VLA framework that replaces the monolithic action head of a vision-language backbone with a Structured Action Expert (SAE) using a modular coordination-aware loss to shape shared (task-level coordination) and residual (per-arm execution) latents, plus a Latent-Aware Controller (LAC) that modulates synchronization, asymmetry, smoothness, and safety at the joint-command level. It reports empirical gains over monolithic baselines: 27% success-rate improvement in tight-coordination tasks, more than doubling OOD real-world success (13% to 27%), and up to 25% reduction in task completion time.

Significance. If the reported performance deltas prove robust, the explicit structural priors could improve reliability and interpretability for tightly coupled dual-arm tasks where implicit coordination from large VL backbones falls short, while remaining compatible with standard control pipelines. The work is an empirical architecture contribution rather than a parameter-free derivation or machine-checked proof.

major comments (1)
  1. [Abstract / Results] Abstract and results: quantitative claims (27% success gain, OOD doubling from 13% to 27%, 25% time reduction) are presented without accompanying information on baselines, number of trials, statistical significance, or exclusion criteria, preventing evaluation of the central performance claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point-by-point below and will incorporate revisions to improve clarity.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and results: quantitative claims (27% success gain, OOD doubling from 13% to 27%, 25% time reduction) are presented without accompanying information on baselines, number of trials, statistical significance, or exclusion criteria, preventing evaluation of the central performance claim.

    Authors: We agree that the abstract would benefit from additional context to support evaluation of the reported gains. In the revised manuscript, we will expand the abstract to briefly specify the baselines (monolithic VLA models without structured action modeling), note the number of trials conducted across simulation and real-world settings, and indicate that performance differences were assessed for statistical significance. The results section of the full paper already details trial counts, variance, statistical tests, and exclusion criteria for failed or invalid runs; we will add explicit forward references from the abstract and results summary to these details. This will make the central claims more transparent without altering the reported numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript describes an empirical architecture modification to VLA models via a Structured Action Expert (SAE) and Latent-Aware Controller (LAC), together with a modular coordination-aware loss that shapes shared and residual latents. No equations, derivations, or parameter-fitting procedures are exhibited anywhere in the provided text. Performance claims (27% success-rate gain, OOD doubling, 25% time reduction) are presented as outcomes of simulation and real-world benchmarks against monolithic baselines rather than quantities obtained by construction from fitted inputs or self-citations. The central argument therefore remains self-contained and externally falsifiable through the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract introduces two new architectural modules (SAE, LAC) and a modular loss without citing independent prior evidence or external benchmarks for these exact components; no free parameters are named.

invented entities (2)
  • Structured Action Expert (SAE) no independent evidence
    purpose: Replace monolithic action head to enforce shared coordination latent and per-arm residual latents
    New module introduced in the paper; no independent evidence supplied in abstract.
  • Latent-Aware Controller (LAC) no independent evidence
    purpose: Interpret learned latents to modulate synchronization, asymmetry, smoothness and safety at joint-command level
    New runtime component introduced in the paper; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5850 in / 1229 out tokens · 23573 ms · 2026-06-26T16:52:13.289947+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 8 linked inside Pith

  1. [1]

    RT-2: Vision-language-action models transfer web knowledge to robotic control,

    A. Brohan, N. Brown, J. Carbajal, et al., “RT-2: Vision-language-action models transfer web knowledge to robotic control,”arXiv preprint arXiv:2307.15818, 2023

  2. [2]

    OpenVLA: An open-source vision-language-action model,

    M. J. Kim, K. Pertsch, S. Karamcheti, et al., “OpenVLA: An open-source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

  3. [3]

    Octo: An open-source generalist robot policy,

    Octo Model Team, D. Ghosh, H. Walke, et al., “Octo: An open-source generalist robot policy,” inProc. Robotics: Science and Systems (RSS), 2024

  4. [4]

    π 0: A vision-language- action flow model for general robot control,

    K. Black, N. Brown, D. Driess, et al., “π 0: A vision-language- action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

  5. [5]

    π 0.5: A vision-language-action model with open-world generalization,

    Physical Intelligence, “π 0.5: A vision-language-action model with open-world generalization,”arXiv preprint arXiv:2504.16054, 2025

  6. [6]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, Z. Xu, S. Feng, et al., “Diffusion policy: Visuomotor policy learning via action diffusion,”Int. J. Robotics Research, 2025

  7. [7]

    RDT-1B: A diffusion foundation model for bimanual manipulation,

    S. Liu, L. Wu, B. Li, et al., “RDT-1B: A diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024

  8. [8]

    Learning fine-grained bimanual manipulation with low-cost hardware,

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” inProc. Robotics: Science and Systems (RSS), 2023

  9. [9]

    Mobile ALOHA: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,

    Z. Fu, T. Z. Zhao, and C. Finn, “Mobile ALOHA: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,” in Conf. Robot Learning (CoRL), 2024

  10. [10]

    ALOHA unleashed: A simple recipe for robot dexterity,

    T. Zhao, et al., “ALOHA unleashed: A simple recipe for robot dexterity,”arXiv preprint arXiv:2410.13126, 2024

  11. [11]

    RoboTwin: Dual-arm robot bench- mark with generative digital twins,

    Y . Mu, T. Chen, S. Peng, et al., “RoboTwin: Dual-arm robot bench- mark with generative digital twins,” inECCV Workshop, 2024

  12. [12]

    RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual manipulation,

    T. Chen, et al., “RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual manipulation,”arXiv preprint arXiv:2506.18088, 2025

  13. [13]

    A unified approach to motion and force control of robot manipulators: The operational space formulation,

    O. Khatib, “A unified approach to motion and force control of robot manipulators: The operational space formulation,”IEEE J. Robotics and Automation, vol. 3, no. 1, pp. 43–53, 1987

  14. [14]

    The virtual linkage: A model for internal forces in multi-grasp manipulation,

    D. Williams and O. Khatib, “The virtual linkage: A model for internal forces in multi-grasp manipulation,” inProc. IEEE Int. Conf. Robotics and Automation, 1993, pp. 1025–1030

  15. [15]

    A symmetric hybrid position/force control scheme for the coordination of two robots,

    M. Uchiyama and P. Dauchez, “A symmetric hybrid position/force control scheme for the coordination of two robots,” inProc. IEEE Int. Conf. Robotics and Automation, 1988, pp. 350–356

  16. [16]

    Dual arm manipulation—A survey,

    C. Smith, Y . Karayiannidis, L. Nalpantidis, et al., “Dual arm manipulation—A survey,”Robotics and Autonomous Systems, vol. 60, no. 10, pp. 1340–1353, 2012

  17. [17]

    Cooperative manipulation,

    F. Caccavale and M. Uchiyama, “Cooperative manipulation,” in Springer Handbook of Robotics, B. Siciliano and O. Khatib, Eds., 2016, pp. 989–1006

  18. [18]

    A uni- fied framework for coordinated multi-arm motion planning,

    S. S. Mirrazavi Salehian, N. Figueroa, and A. Billard, “A uni- fied framework for coordinated multi-arm motion planning,”Int. J. Robotics Research, vol. 37, no. 13–14, pp. 1765–1797, 2018

  19. [19]

    An overview of multi-task learning in deep neural net- works,

    S. Ruder, “An overview of multi-task learning in deep neural net- works,”arXiv preprint arXiv:1706.05098, 2017

  20. [20]

    QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learn- ing,

    T. Rashid, M. Samvelyan, C. S. de Witt, et al., “QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learn- ing,” inProc. ICML, 2018

  21. [21]

    The surprising effectiveness of PPO in cooperative multi-agent games,

    C. Yu, A. Velu, E. Vinitsky, et al., “The surprising effectiveness of PPO in cooperative multi-agent games,” inProc. NeurIPS, 2022

  22. [22]

    Abstracting robot manipula- tion skills via mixture-of-experts diffusion policies,

    C. Hao, X. Zhai, Y . Liu, and H. Soh, “Abstracting robot manipula- tion skills via mixture-of-experts diffusion policies,”arXiv preprint arXiv:2601.21251, 2026

  23. [23]

    Learning bimanual manipulation via action chunking and inter-arm coordination with transformers,

    T. Motoda, R. Hanai, R. Nakajo, M. Murooka, F. Erich, and Y . Domae, “Learning bimanual manipulation via action chunking and inter-arm coordination with transformers,”arXiv preprint arXiv:2503.13916, 2025

  24. [24]

    Residual policy learning,

    T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling, “Residual policy learning,”arXiv preprint arXiv:1812.06298, 2018

  25. [25]

    Residual reinforcement learning for robot control,

    T. Johannink, S. Bahl, A. Nair, et al., “Residual reinforcement learning for robot control,” inProc. IEEE Int. Conf. Robotics and Automation, 2019

  26. [26]

    Control barrier func- tions: Theory and applications,

    A. D. Ames, S. Coogan, M. Egerstedt, et al., “Control barrier func- tions: Theory and applications,” inProc. European Control Confer- ence, 2019

  27. [27]

    Six- DOF impedance control of dual-arm cooperative manipulators,

    F. Caccavale, P. Chiacchio, A. Marino, and L. Villani, “Six- DOF impedance control of dual-arm cooperative manipulators,” IEEE/ASME Trans. Mechatronics, vol. 13, no. 5, pp. 576–586, 2008

  28. [28]

    Impedance behaviors for two- handed manipulation: Design and experiments,

    T. Wimb ¨ock, C. Ott, and G. Hirzinger, “Impedance behaviors for two- handed manipulation: Design and experiments,” inProc. IEEE Int. Conf. Robotics and Automation, 2007, pp. 4182–4189