Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems

Chao Zhang; Daehyun Ji; Dongwook Lee; Jaewook Yoo; Jiaqian Yu; Lu Xu; Mingbo Zhao; Weiming Li; Xiongfeng Peng; Yamin Mao

arxiv: 2606.20285 · v1 · pith:N3XN6UOLnew · submitted 2026-06-18 · 💻 cs.RO

Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems

Yandong Wang , Jiaqian Yu , Xiongfeng Peng , Lu Xu , Yamin Mao , Weiming Li , Jaewook Yoo , Dongwook Lee

show 3 more authors

Daehyun Ji Mingbo Zhao Chao Zhang

This is my paper

Pith reviewed 2026-06-26 16:52 UTC · model grok-4.3

classification 💻 cs.RO

keywords dual-arm manipulationvision-language-actionbimanual coordinationstructured action modelingcoordination-aware lossrobotic manipulationlatent representations

0 comments

The pith

Co-VLA introduces explicit structural priors into vision-language-action models so that dual-arm robots can coordinate tightly coupled tasks through separated shared and residual action latents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that monolithic end-to-end VLA models are insufficient for reliable bimanual coordination once tasks impose tight coupling and execution constraints, and that adding explicit structure at the action head remedies this. It does so by replacing the single action predictor with a Structured Action Expert whose modular coordination-aware loss forces a shared latent to carry task-level intent while residual latents carry arm-specific corrections. A Latent-Aware Controller then reads these latents at runtime to adjust synchronization, asymmetry, smoothness, and safety directly in the joint-command stream. If the claim holds, dual-arm systems could achieve stable, interpretable behavior on assembly, handover, and similar tasks without custom force or impedance controllers and with better out-of-distribution robustness.

Core claim

Co-VLA replaces the monolithic action head of a vision-language backbone with a Structured Action Expert that applies a modular coordination-aware loss; the loss shapes a shared latent to encode task-level coordination intent and residual latents to encode per-arm execution adjustments, after which a Latent-Aware Controller interprets the learned representations to modulate synchronization strength, execution asymmetry, smoothness, and safety constraints at the joint-command level in real time while remaining compatible with standard control pipelines.

What carries the argument

Structured Action Expert (SAE) whose modular coordination-aware loss separates shared coordination latents from arm-specific residual latents, paired with the Latent-Aware Controller (LAC) that reads those latents to adjust real-time control parameters.

If this is right

Yields a 27% success-rate gain over monolithic baselines on tight-coordination tasks.
More than doubles success rate on out-of-distribution real-world scenarios, rising from 13% to 27%.
Shortens task completion time by up to 25%.
Operates at the joint-command level and integrates with existing control pipelines without force or impedance sensing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit separation of coordination intent from arm-specific residuals may simplify debugging and partial retraining when only one arm's behavior needs adjustment.
Because the controller acts on already-learned latents rather than raw observations, the same representations could support transfer to related coordination problems such as multi-robot handover.
The joint-command compatibility suggests the method could be retrofitted onto existing dual-arm platforms with only a change to the policy head and no alteration to low-level hardware controllers.

Load-bearing premise

The modular coordination-aware loss will shape shared and residual latents according to task-specific structures in a manner that generalizes to unseen real-world conditions without post-hoc tuning of the loss weights or latent dimensions.

What would settle it

Performance on new real-world dual-arm tasks remains at baseline levels when the loss weights and latent dimensions are held fixed at the values used in the reported training runs.

Figures

Figures reproduced from arXiv: 2606.20285 by Chao Zhang, Daehyun Ji, Dongwook Lee, Jaewook Yoo, Jiaqian Yu, Lu Xu, Mingbo Zhao, Weiming Li, Xiongfeng Peng, Yamin Mao, Yandong Wang.

**Figure 2.** Figure 2: Structured Action Expert (SAE) architecture. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of sequential motion paradigm (left) and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Examples of out-of-distribution conditions for real [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Effect of LAC on joint trajectories. LAC produces [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Vision-language-action (VLA) models show strong capabilities in single and dual-arm robotic manipulation. Prior works show coordinated bimanual behaviors can emerge from end-to-end learning, leveraging large vision-language backbones with continuous action prediction. However, as bimanual tasks become tightly coupled and execution constraints become critical, implicit coordination alone is insufficient to ensure reliable, interpretable, and stable behavior. In this work, we propose Co-VLA, a coordination-aware bimanual manipulation framework introducing explicit structural priors into VLA models. We instantiate our method on a state-of-the-art vision-language backbone by replacing its monolithic action head with a Structured Action Expert (SAE) designed for bimanual coordination. Specifically, we introduce explicit structure at the action generation level with a modular coordination-aware loss that shapes shared and residual latents according to task-specific structures. The shared latent encodes task-level coordination intent, while residual latents capture execution adjustments for each arm. At deployment, a Latent-Aware Controller (LAC) interprets the learned representations to modulate synchronization strength, execution asymmetry, smoothness, and safety constraints in real time. LAC operates at the joint-command level and remains compatible with standard control pipelines without requiring force or impedance control. Experiments across simulation and real-world benchmarks show Co-VLA significantly outperforms monolithic baselines, achieving a 27% success rate gain in tight-coordination tasks, more than doubling performance in OOD real-world scenarios (from 13% to 27%), and reducing task completion time by up to 25%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Co-VLA adds an explicit shared/residual latent split plus runtime LAC modulation to VLA action heads for bimanual tasks, and the reported gains on tight coordination look plausible but rest on the untested claim that the modular loss generalizes without extra tuning.

read the letter

The core addition is replacing the usual monolithic action head with a Structured Action Expert that factors coordination intent into a shared latent and arm-specific residuals, then uses a Latent-Aware Controller at inference to adjust sync, asymmetry, and safety on the fly. That structural choice directly targets the cases where end-to-end VLA falls short on tightly coupled dual-arm work, which is a real practical gap.

The paper shows decent empirical movement: 27% success lift on tight tasks, doubling on out-of-distribution real-world runs, and up to 25% faster completion. Those numbers are the kind of delta that matters for deployment, and the method stays compatible with standard joint-level control without needing force feedback. The framing is straightforward and the motivation is clear.

The soft spot is still the generalization story. The coordination-aware loss is supposed to shape the latents in a task-specific way that carries over to unseen conditions, but the write-up does not appear to include ablations on loss weight sensitivity or latent dimension choices. If those hyperparameters need per-task retuning, the practical advantage shrinks. The abstract-level performance claims also leave open questions about baseline strength, trial counts, and whether the OOD split was truly held out. Without those details the 27% figure is hard to weigh.

This is aimed at the bimanual robotics crowd already working with VLA backbones. A reader who cares about making coordination explicit rather than hoping it emerges will get something usable from the architecture description. The work is coherent on its own terms and engages the right prior literature, so it clears the bar for serious refereeing even if the experiments need tightening.

Referee Report

1 major / 0 minor

Summary. The paper proposes Co-VLA, a coordination-aware bimanual VLA framework that replaces the monolithic action head of a vision-language backbone with a Structured Action Expert (SAE) using a modular coordination-aware loss to shape shared (task-level coordination) and residual (per-arm execution) latents, plus a Latent-Aware Controller (LAC) that modulates synchronization, asymmetry, smoothness, and safety at the joint-command level. It reports empirical gains over monolithic baselines: 27% success-rate improvement in tight-coordination tasks, more than doubling OOD real-world success (13% to 27%), and up to 25% reduction in task completion time.

Significance. If the reported performance deltas prove robust, the explicit structural priors could improve reliability and interpretability for tightly coupled dual-arm tasks where implicit coordination from large VL backbones falls short, while remaining compatible with standard control pipelines. The work is an empirical architecture contribution rather than a parameter-free derivation or machine-checked proof.

major comments (1)

[Abstract / Results] Abstract and results: quantitative claims (27% success gain, OOD doubling from 13% to 27%, 25% time reduction) are presented without accompanying information on baselines, number of trials, statistical significance, or exclusion criteria, preventing evaluation of the central performance claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point-by-point below and will incorporate revisions to improve clarity.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results: quantitative claims (27% success gain, OOD doubling from 13% to 27%, 25% time reduction) are presented without accompanying information on baselines, number of trials, statistical significance, or exclusion criteria, preventing evaluation of the central performance claim.

Authors: We agree that the abstract would benefit from additional context to support evaluation of the reported gains. In the revised manuscript, we will expand the abstract to briefly specify the baselines (monolithic VLA models without structured action modeling), note the number of trials conducted across simulation and real-world settings, and indicate that performance differences were assessed for statistical significance. The results section of the full paper already details trial counts, variance, statistical tests, and exclusion criteria for failed or invalid runs; we will add explicit forward references from the abstract and results summary to these details. This will make the central claims more transparent without altering the reported numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript describes an empirical architecture modification to VLA models via a Structured Action Expert (SAE) and Latent-Aware Controller (LAC), together with a modular coordination-aware loss that shapes shared and residual latents. No equations, derivations, or parameter-fitting procedures are exhibited anywhere in the provided text. Performance claims (27% success-rate gain, OOD doubling, 25% time reduction) are presented as outcomes of simulation and real-world benchmarks against monolithic baselines rather than quantities obtained by construction from fitted inputs or self-citations. The central argument therefore remains self-contained and externally falsifiable through the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract introduces two new architectural modules (SAE, LAC) and a modular loss without citing independent prior evidence or external benchmarks for these exact components; no free parameters are named.

invented entities (2)

Structured Action Expert (SAE) no independent evidence
purpose: Replace monolithic action head to enforce shared coordination latent and per-arm residual latents
New module introduced in the paper; no independent evidence supplied in abstract.
Latent-Aware Controller (LAC) no independent evidence
purpose: Interpret learned latents to modulate synchronization, asymmetry, smoothness and safety at joint-command level
New runtime component introduced in the paper; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5850 in / 1229 out tokens · 23573 ms · 2026-06-26T16:52:13.289947+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 8 linked inside Pith

[1]

RT-2: Vision-language-action models transfer web knowledge to robotic control,

A. Brohan, N. Brown, J. Carbajal, et al., “RT-2: Vision-language-action models transfer web knowledge to robotic control,”arXiv preprint arXiv:2307.15818, 2023

Pith/arXiv arXiv 2023
[2]

OpenVLA: An open-source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, et al., “OpenVLA: An open-source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[3]

Octo: An open-source generalist robot policy,

Octo Model Team, D. Ghosh, H. Walke, et al., “Octo: An open-source generalist robot policy,” inProc. Robotics: Science and Systems (RSS), 2024

2024
[4]

π 0: A vision-language- action flow model for general robot control,

K. Black, N. Brown, D. Driess, et al., “π 0: A vision-language- action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[5]

π 0.5: A vision-language-action model with open-world generalization,

Physical Intelligence, “π 0.5: A vision-language-action model with open-world generalization,”arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[6]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, et al., “Diffusion policy: Visuomotor policy learning via action diffusion,”Int. J. Robotics Research, 2025

2025
[7]

RDT-1B: A diffusion foundation model for bimanual manipulation,

S. Liu, L. Wu, B. Li, et al., “RDT-1B: A diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024

Pith/arXiv arXiv 2024
[8]

Learning fine-grained bimanual manipulation with low-cost hardware,

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” inProc. Robotics: Science and Systems (RSS), 2023

2023
[9]

Mobile ALOHA: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,

Z. Fu, T. Z. Zhao, and C. Finn, “Mobile ALOHA: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,” in Conf. Robot Learning (CoRL), 2024

2024
[10]

ALOHA unleashed: A simple recipe for robot dexterity,

T. Zhao, et al., “ALOHA unleashed: A simple recipe for robot dexterity,”arXiv preprint arXiv:2410.13126, 2024

arXiv 2024
[11]

RoboTwin: Dual-arm robot bench- mark with generative digital twins,

Y . Mu, T. Chen, S. Peng, et al., “RoboTwin: Dual-arm robot bench- mark with generative digital twins,” inECCV Workshop, 2024

2024
[12]

RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual manipulation,

T. Chen, et al., “RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual manipulation,”arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025
[13]

A unified approach to motion and force control of robot manipulators: The operational space formulation,

O. Khatib, “A unified approach to motion and force control of robot manipulators: The operational space formulation,”IEEE J. Robotics and Automation, vol. 3, no. 1, pp. 43–53, 1987

1987
[14]

The virtual linkage: A model for internal forces in multi-grasp manipulation,

D. Williams and O. Khatib, “The virtual linkage: A model for internal forces in multi-grasp manipulation,” inProc. IEEE Int. Conf. Robotics and Automation, 1993, pp. 1025–1030

1993
[15]

A symmetric hybrid position/force control scheme for the coordination of two robots,

M. Uchiyama and P. Dauchez, “A symmetric hybrid position/force control scheme for the coordination of two robots,” inProc. IEEE Int. Conf. Robotics and Automation, 1988, pp. 350–356

1988
[16]

Dual arm manipulation—A survey,

C. Smith, Y . Karayiannidis, L. Nalpantidis, et al., “Dual arm manipulation—A survey,”Robotics and Autonomous Systems, vol. 60, no. 10, pp. 1340–1353, 2012

2012
[17]

Cooperative manipulation,

F. Caccavale and M. Uchiyama, “Cooperative manipulation,” in Springer Handbook of Robotics, B. Siciliano and O. Khatib, Eds., 2016, pp. 989–1006

2016
[18]

A uni- fied framework for coordinated multi-arm motion planning,

S. S. Mirrazavi Salehian, N. Figueroa, and A. Billard, “A uni- fied framework for coordinated multi-arm motion planning,”Int. J. Robotics Research, vol. 37, no. 13–14, pp. 1765–1797, 2018

2018
[19]

An overview of multi-task learning in deep neural net- works,

S. Ruder, “An overview of multi-task learning in deep neural net- works,”arXiv preprint arXiv:1706.05098, 2017

Pith/arXiv arXiv 2017
[20]

QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learn- ing,

T. Rashid, M. Samvelyan, C. S. de Witt, et al., “QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learn- ing,” inProc. ICML, 2018

2018
[21]

The surprising effectiveness of PPO in cooperative multi-agent games,

C. Yu, A. Velu, E. Vinitsky, et al., “The surprising effectiveness of PPO in cooperative multi-agent games,” inProc. NeurIPS, 2022

2022
[22]

Abstracting robot manipula- tion skills via mixture-of-experts diffusion policies,

C. Hao, X. Zhai, Y . Liu, and H. Soh, “Abstracting robot manipula- tion skills via mixture-of-experts diffusion policies,”arXiv preprint arXiv:2601.21251, 2026

arXiv 2026
[23]

Learning bimanual manipulation via action chunking and inter-arm coordination with transformers,

T. Motoda, R. Hanai, R. Nakajo, M. Murooka, F. Erich, and Y . Domae, “Learning bimanual manipulation via action chunking and inter-arm coordination with transformers,”arXiv preprint arXiv:2503.13916, 2025

arXiv 2025
[24]

Residual policy learning,

T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling, “Residual policy learning,”arXiv preprint arXiv:1812.06298, 2018

Pith/arXiv arXiv 2018
[25]

Residual reinforcement learning for robot control,

T. Johannink, S. Bahl, A. Nair, et al., “Residual reinforcement learning for robot control,” inProc. IEEE Int. Conf. Robotics and Automation, 2019

2019
[26]

Control barrier func- tions: Theory and applications,

A. D. Ames, S. Coogan, M. Egerstedt, et al., “Control barrier func- tions: Theory and applications,” inProc. European Control Confer- ence, 2019

2019
[27]

Six- DOF impedance control of dual-arm cooperative manipulators,

F. Caccavale, P. Chiacchio, A. Marino, and L. Villani, “Six- DOF impedance control of dual-arm cooperative manipulators,” IEEE/ASME Trans. Mechatronics, vol. 13, no. 5, pp. 576–586, 2008

2008
[28]

Impedance behaviors for two- handed manipulation: Design and experiments,

T. Wimb ¨ock, C. Ott, and G. Hirzinger, “Impedance behaviors for two- handed manipulation: Design and experiments,” inProc. IEEE Int. Conf. Robotics and Automation, 2007, pp. 4182–4189

2007

[1] [1]

RT-2: Vision-language-action models transfer web knowledge to robotic control,

A. Brohan, N. Brown, J. Carbajal, et al., “RT-2: Vision-language-action models transfer web knowledge to robotic control,”arXiv preprint arXiv:2307.15818, 2023

Pith/arXiv arXiv 2023

[2] [2]

OpenVLA: An open-source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, et al., “OpenVLA: An open-source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[3] [3]

Octo: An open-source generalist robot policy,

Octo Model Team, D. Ghosh, H. Walke, et al., “Octo: An open-source generalist robot policy,” inProc. Robotics: Science and Systems (RSS), 2024

2024

[4] [4]

π 0: A vision-language- action flow model for general robot control,

K. Black, N. Brown, D. Driess, et al., “π 0: A vision-language- action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[5] [5]

π 0.5: A vision-language-action model with open-world generalization,

Physical Intelligence, “π 0.5: A vision-language-action model with open-world generalization,”arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[6] [6]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, et al., “Diffusion policy: Visuomotor policy learning via action diffusion,”Int. J. Robotics Research, 2025

2025

[7] [7]

RDT-1B: A diffusion foundation model for bimanual manipulation,

S. Liu, L. Wu, B. Li, et al., “RDT-1B: A diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024

Pith/arXiv arXiv 2024

[8] [8]

Learning fine-grained bimanual manipulation with low-cost hardware,

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” inProc. Robotics: Science and Systems (RSS), 2023

2023

[9] [9]

Mobile ALOHA: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,

Z. Fu, T. Z. Zhao, and C. Finn, “Mobile ALOHA: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,” in Conf. Robot Learning (CoRL), 2024

2024

[10] [10]

ALOHA unleashed: A simple recipe for robot dexterity,

T. Zhao, et al., “ALOHA unleashed: A simple recipe for robot dexterity,”arXiv preprint arXiv:2410.13126, 2024

arXiv 2024

[11] [11]

RoboTwin: Dual-arm robot bench- mark with generative digital twins,

Y . Mu, T. Chen, S. Peng, et al., “RoboTwin: Dual-arm robot bench- mark with generative digital twins,” inECCV Workshop, 2024

2024

[12] [12]

RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual manipulation,

T. Chen, et al., “RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual manipulation,”arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025

[13] [13]

A unified approach to motion and force control of robot manipulators: The operational space formulation,

O. Khatib, “A unified approach to motion and force control of robot manipulators: The operational space formulation,”IEEE J. Robotics and Automation, vol. 3, no. 1, pp. 43–53, 1987

1987

[14] [14]

The virtual linkage: A model for internal forces in multi-grasp manipulation,

D. Williams and O. Khatib, “The virtual linkage: A model for internal forces in multi-grasp manipulation,” inProc. IEEE Int. Conf. Robotics and Automation, 1993, pp. 1025–1030

1993

[15] [15]

A symmetric hybrid position/force control scheme for the coordination of two robots,

M. Uchiyama and P. Dauchez, “A symmetric hybrid position/force control scheme for the coordination of two robots,” inProc. IEEE Int. Conf. Robotics and Automation, 1988, pp. 350–356

1988

[16] [16]

Dual arm manipulation—A survey,

C. Smith, Y . Karayiannidis, L. Nalpantidis, et al., “Dual arm manipulation—A survey,”Robotics and Autonomous Systems, vol. 60, no. 10, pp. 1340–1353, 2012

2012

[17] [17]

Cooperative manipulation,

F. Caccavale and M. Uchiyama, “Cooperative manipulation,” in Springer Handbook of Robotics, B. Siciliano and O. Khatib, Eds., 2016, pp. 989–1006

2016

[18] [18]

A uni- fied framework for coordinated multi-arm motion planning,

S. S. Mirrazavi Salehian, N. Figueroa, and A. Billard, “A uni- fied framework for coordinated multi-arm motion planning,”Int. J. Robotics Research, vol. 37, no. 13–14, pp. 1765–1797, 2018

2018

[19] [19]

An overview of multi-task learning in deep neural net- works,

S. Ruder, “An overview of multi-task learning in deep neural net- works,”arXiv preprint arXiv:1706.05098, 2017

Pith/arXiv arXiv 2017

[20] [20]

QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learn- ing,

T. Rashid, M. Samvelyan, C. S. de Witt, et al., “QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learn- ing,” inProc. ICML, 2018

2018

[21] [21]

The surprising effectiveness of PPO in cooperative multi-agent games,

C. Yu, A. Velu, E. Vinitsky, et al., “The surprising effectiveness of PPO in cooperative multi-agent games,” inProc. NeurIPS, 2022

2022

[22] [22]

Abstracting robot manipula- tion skills via mixture-of-experts diffusion policies,

C. Hao, X. Zhai, Y . Liu, and H. Soh, “Abstracting robot manipula- tion skills via mixture-of-experts diffusion policies,”arXiv preprint arXiv:2601.21251, 2026

arXiv 2026

[23] [23]

Learning bimanual manipulation via action chunking and inter-arm coordination with transformers,

T. Motoda, R. Hanai, R. Nakajo, M. Murooka, F. Erich, and Y . Domae, “Learning bimanual manipulation via action chunking and inter-arm coordination with transformers,”arXiv preprint arXiv:2503.13916, 2025

arXiv 2025

[24] [24]

Residual policy learning,

T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling, “Residual policy learning,”arXiv preprint arXiv:1812.06298, 2018

Pith/arXiv arXiv 2018

[25] [25]

Residual reinforcement learning for robot control,

T. Johannink, S. Bahl, A. Nair, et al., “Residual reinforcement learning for robot control,” inProc. IEEE Int. Conf. Robotics and Automation, 2019

2019

[26] [26]

Control barrier func- tions: Theory and applications,

A. D. Ames, S. Coogan, M. Egerstedt, et al., “Control barrier func- tions: Theory and applications,” inProc. European Control Confer- ence, 2019

2019

[27] [27]

Six- DOF impedance control of dual-arm cooperative manipulators,

F. Caccavale, P. Chiacchio, A. Marino, and L. Villani, “Six- DOF impedance control of dual-arm cooperative manipulators,” IEEE/ASME Trans. Mechatronics, vol. 13, no. 5, pp. 576–586, 2008

2008

[28] [28]

Impedance behaviors for two- handed manipulation: Design and experiments,

T. Wimb ¨ock, C. Ott, and G. Hirzinger, “Impedance behaviors for two- handed manipulation: Design and experiments,” inProc. IEEE Int. Conf. Robotics and Automation, 2007, pp. 4182–4189

2007