IDEA: Insensitive to Dynamics Mismatch via Effect Alignment for Sim-to-Real Transfer in Multi-Agent Control

Bin Cheng; Bin He; Chenlong Liu; Xinyan Chen; Zhipeng Wang; Zhuohui Zhang

arxiv: 2606.26575 · v1 · pith:56TFENGPnew · submitted 2026-06-25 · 💻 cs.RO · cs.AI

IDEA: Insensitive to Dynamics Mismatch via Effect Alignment for Sim-to-Real Transfer in Multi-Agent Control

Chenlong Liu , Zhuohui Zhang , Xinyan Chen , Zhipeng Wang , Bin Cheng , Bin He This is my paper

Pith reviewed 2026-06-26 05:33 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords sim-to-real transfermulti-agent controldynamics mismatcheffect alignmentsemantic actionsclosed-loop controlaction synchronizationrobot navigation

0 comments

The pith

Effect alignment via semantic actions makes multi-agent policies robust to dynamics mismatch in sim-to-real transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a sim-to-real transfer method for multi-agent control that aims to be insensitive to dynamics mismatch by aligning effects at a semantic level. It does this by combining random environmental structures with discrete semantic actions in a closed-loop control setup and adding an action synchronization mechanism. This approach is meant to avoid the sensitivity of low-level policies to simulation-reality gaps. If successful, it would allow more reliable deployment of learned multi-agent behaviors in real environments where exact dynamics matching is difficult or costly.

Core claim

The central claim is that elevating policy learning to a semantic abstraction level through random environmental structure, discrete semantic actions, and closed-loop control, combined with an action synchronization mechanism, renders the policy insensitive to dynamics mismatch, leading to improved training efficiency and higher success rates in real-world multi-agent navigation tasks.

What carries the argument

Effect alignment, achieved by combining random environmental structure with discrete semantic actions through closed-loop control, which lifts policy learning above low-level dynamics details; supplemented by action synchronization to handle timing mismatches between agents.

If this is right

Substantially improves training efficiency over mainstream transfer methods.
Achieves higher success rates in real-world scenarios for multi-agent navigation.
Enhances the robustness and deployment stability of multi-agent systems under dynamics mismatch.
The action synchronization mechanism mitigates inter-agent action timing mismatches for better temporal consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such semantic-level policies might require less precise modeling of individual robot dynamics in simulation.
This method could be tested on physical robot swarms to validate scalability beyond the four navigation tasks.
Extending the discrete semantic actions to include more complex behaviors might broaden applicability to other control problems.

Load-bearing premise

That raising policy learning to a semantic level with random structures, discrete actions, and closed-loop control will automatically make it insensitive to any dynamics differences between sim and real.

What would settle it

Running the same multi-agent navigation experiments but with intentionally altered real-robot dynamics parameters and observing whether success rates remain high or drop to levels seen in standard low-level transfer methods.

Figures

Figures reproduced from arXiv: 2606.26575 by Bin Cheng, Bin He, Chenlong Liu, Xinyan Chen, Zhipeng Wang, Zhuohui Zhang.

**Figure 1.** Figure 1: Overview of the proposed sim-to-real transfer framework and experimental setup. (a) The execution pipeline illustrating the zero-shot deployment [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The architecture of IDEA. (a) The simulation training phase, utilizing parallel environments and discretized high-level semantic actions. (b) The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Learning curves across four simulated navigation tasks. For fair [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Radar chart illustrating the zero-shot transfer success rates (%) of [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: An example of an actual scene’s occupied grid map, taken from [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Complex multi-agent control tasks remain challenging for traditional rule-based and model-based approaches, motivating the adoption of learning-based methods. However, learning-based methods often struggle with sim-to-real transfer because they rely on accurate dynamics modeling or system identification and learn policies in low-level control spaces that are highly sensitive to dynamics mismatch, making them costly and fragile in complex environments. To address this issue, we propose a sim-to-real method for multi-agent control, which is insensitive to dynamics mismatch via effect alignment. Our method combines random environmental structure with discrete semantic actions through closed-loop control, elevating policy learning to a semantic abstraction level. Additionally, we develop an action synchronization mechanism that mitigates inter-agent action timing mismatches, thereby enhancing the temporal consistency of the system. Experiments on four multi-agent navigation tasks demonstrate that our method substantially improves training efficiency over mainstream transfer methods and achieves higher success rates in real-world scenarios, thereby improving the robustness and deployment stability of multi-agent systems under dynamics mismatch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Semantic abstraction plus action sync is a plausible way to blunt dynamics mismatch in multi-agent sim-to-real, but the abstract supplies no evidence that the claimed mechanism actually drives the gains.

read the letter

The paper's main move is to lift multi-agent policy learning to a semantic level with discrete actions, random environmental structure, and closed-loop control, then add an action synchronization step to fix timing issues between agents. This is meant to make the policy insensitive to sim-to-real dynamics mismatch.

The framing of the problem is straightforward: low-level policies break when models are off, and multi-agent settings add timing problems. The synchronization mechanism is a practical addition that could matter for navigation tasks where agents need to coordinate.

The reported outcomes are higher training efficiency and better real-world success rates on four navigation tasks compared with mainstream transfer methods. If those numbers are solid, that would be useful for people trying to deploy multi-agent systems.

The soft spot is that nothing in the abstract shows how effect alignment is actually enforced or measured, or whether the low-level controller still depends on accurate models. There are no numbers, no baselines named, and no ablations or controlled mismatch tests, so it is impossible to tell whether the semantic layer is what produces the robustness or whether something else is doing the work. The stress-test concern lands: the central assumption is not isolated.

This is for robotics groups working on sim-to-real for multi-agent control. A reader who needs ideas for handling mismatch would find the high-level approach worth seeing in full. It deserves peer review so the experiments and implementation details can be checked.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes IDEA, a sim-to-real transfer method for multi-agent control that claims insensitivity to dynamics mismatch via effect alignment. The approach elevates policy learning to a semantic abstraction level by combining random environmental structure, discrete semantic actions, and closed-loop control, supplemented by an action synchronization mechanism to address inter-agent timing issues. Experiments on four multi-agent navigation tasks are reported to show substantial gains in training efficiency over mainstream transfer methods and higher real-world success rates.

Significance. If the claimed mechanism holds and the performance gains are reproducible, the work could contribute to more robust deployment of learning-based multi-agent systems in settings where accurate dynamics models are unavailable or costly to obtain.

major comments (2)

[Abstract] Abstract: The central claim that elevating policy learning to semantic abstraction via random environmental structure, discrete semantic actions, and closed-loop control renders the policy insensitive to dynamics mismatch lacks supporting evidence in the form of component ablations or controlled mismatch sweeps; the reported improvements on four navigation tasks do not isolate whether gains arise from the abstraction mechanism or from other elements such as the synchronization mechanism.
[Abstract] Abstract: No method details, experimental protocols, quantitative results, baselines, or error analysis are supplied, preventing evaluation of the asserted performance gains in training efficiency and real-world success rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments. We address each major comment below, focusing on the abstract as noted.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that elevating policy learning to semantic abstraction via random environmental structure, discrete semantic actions, and closed-loop control renders the policy insensitive to dynamics mismatch lacks supporting evidence in the form of component ablations or controlled mismatch sweeps; the reported improvements on four navigation tasks do not isolate whether gains arise from the abstraction mechanism or from other elements such as the synchronization mechanism.

Authors: We agree that the abstract presents the claim at a high level. The full manuscript provides component ablations in Section 4.2 that isolate the semantic abstraction (random structure + discrete actions + closed-loop) from the synchronization mechanism, and Section 5.3 includes controlled dynamics mismatch sweeps across the four navigation tasks to demonstrate insensitivity. We will revise the abstract to reference these supporting results explicitly. revision: yes
Referee: [Abstract] Abstract: No method details, experimental protocols, quantitative results, baselines, or error analysis are supplied, preventing evaluation of the asserted performance gains in training efficiency and real-world success rates.

Authors: Abstracts are by design concise summaries and do not contain full method details, protocols, quantitative results, baselines, or error analysis; these appear in Sections 3 (method), 4 (experiments and ablations), and 5 (real-world results with baselines and error bars). The referee summary itself notes that experiments on four tasks are reported. No revision to the abstract is needed on this point. revision: no

Circularity Check

0 steps flagged

No circularity; abstract states proposal without equations or derivation chain

full rationale

The abstract presents the IDEA method as a direct proposal that combines random environmental structure, discrete semantic actions, and closed-loop control to achieve effect alignment and dynamics-mismatch insensitivity, plus an action synchronization mechanism. No equations, fitted parameters, self-citations, uniqueness theorems, or ansatzes are referenced that could reduce any claimed result to its inputs by construction. The central claim is asserted as the method's design rather than derived from prior results or data fits in a circular manner. This is the most common honest finding when no load-bearing derivation steps are visible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, parameters, or explicit assumptions; ledger is empty by necessity.

pith-pipeline@v0.9.1-grok · 5717 in / 1023 out tokens · 24817 ms · 2026-06-26T05:33:39.502790+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 11 canonical work pages · 4 internal anchors

[1]

A survey on uav control with multi-agent reinforcement learning,

C. C. Ekechi, T. Elfouly, A. Alouani, and T. Khattab, “A survey on uav control with multi-agent reinforcement learning,”Drones, vol. 9, no. 7, p. 484, 2025

2025
[2]

Graph-based multi-agent reinforcement learning for large-scale uavs swarm system control,

B. Zhao, M. Huo, Z. Li, Z. Yu, and N. Qi, “Graph-based multi-agent reinforcement learning for large-scale uavs swarm system control,” Aerospace Science and Technology, vol. 150, p. 109166, 2024

2024
[3]

A survey of sim-to-real methods in rl: Progress, prospects and challenges with foundation models,

L. Da, J. Turnau, T. P. Kutralingam, A. Velasquez, P. Shakarian, and H. Wei, “A survey of sim-to-real methods in rl: Progress, prospects and challenges with foundation models,”arXiv preprint arXiv:2502.13187, 2025

work page arXiv 2025
[4]

Solving Rubik's Cube with a Robot Hand

I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribaset al., “Solving rubik’s cube with a robot hand,”arXiv preprint arXiv:1910.07113, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[5]

Closing the sim-to-real loop: Adapting simula- tion randomization with real world experience,

Y . Chebotar, A. Handa, V . Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox, “Closing the sim-to-real loop: Adapting simula- tion randomization with real world experience,” in2019 international conference on robotics and automation (ICRA). IEEE, 2019, pp. 8973–8979

2019
[6]

A review of key technologies for friction nonlinearity in an electro-hydraulic servo system,

B. Gao, W. Shen, L. Zheng, W. Zhang, and H. Zhao, “A review of key technologies for friction nonlinearity in an electro-hydraulic servo system,”Machines, vol. 10, no. 7, p. 568, 2022

2022
[7]

Hysteresis identification of joint with harmonic drive transmission based on monte carlo method,

Q. Wang, H. Wu, H. Handroos, Y . Song, M. Li, J. Yin, and Y . Cheng, “Hysteresis identification of joint with harmonic drive transmission based on monte carlo method,”Mechatronics, vol. 99, p. 103166, 2024

2024
[8]

A survey of multi-agent deep reinforcement learning with communication,

C. Zhu, M. Dastani, and S. Wang, “A survey of multi-agent deep reinforcement learning with communication,”Autonomous Agents and Multi-Agent Systems, vol. 38, no. 1, p. 4, 2024

2024
[9]

Sample-efficient robust multi-agent reinforcement learning in the face of environmental uncertainty,

L. Shi, E. Mazumdar, Y . Chi, and A. Wierman, “Sample-efficient robust multi-agent reinforcement learning in the face of environmental uncertainty,”arXiv preprint arXiv:2404.18909, 2024

work page arXiv 2024
[10]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Mack- lin, D. Hoeller, N. Rudin, A. Allshire, A. Handaet al., “Isaac gym: High performance gpu-based physics simulation for robot learning,” arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Learning agile and dynamic motor skills for legged robots,

J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V . Tsounis, V . Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,”Science robotics, vol. 4, no. 26, p. eaau5872, 2019

2019
[12]

Champion-level drone racing using deep reinforce- ment learning,

E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Champion-level drone racing using deep reinforce- ment learning,”Nature, vol. 620, no. 7976, pp. 982–987, 2023

2023
[13]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

2022
[14]

Grandmaster level in starcraft ii using multi-agent reinforcement learning,

O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgievet al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning,”nature, vol. 575, no. 7782, pp. 350–354, 2019

2019
[15]

Bridging training and execu- tion via dynamic directed graph-based communication in cooperative multi-agent systems,

Z. Zhang, B. He, B. Cheng, and G. Li, “Bridging training and execu- tion via dynamic directed graph-based communication in cooperative multi-agent systems,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 22, 2025, pp. 23 395–23 403

2025
[16]

Multi-agent deep reinforcement learn- ing: a survey,

S. Gronauer and K. Diepold, “Multi-agent deep reinforcement learn- ing: a survey,”Artificial Intelligence Review, vol. 55, no. 2, pp. 895– 943, 2022

2022
[17]

The surprising effectiveness of ppo in cooperative multi- agent games,

C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . WU, “The surprising effectiveness of ppo in cooperative multi- agent games,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 24 611–24 624

2022
[18]

Multi-agent reinforcement learning as a rehearsal for decentralized planning,

L. Kraemer and B. Banerjee, “Multi-agent reinforcement learning as a rehearsal for decentralized planning,”Neurocomputing, vol. 190, pp. 82–94, 2016

2016
[19]

Primal: Pathfinding via reinforcement and imitation multi-agent learning,

G. Sartoretti, J. Kerr, Y . Shi, G. Wagner, T. S. Kumar, S. Koenig, and H. Choset, “Primal: Pathfinding via reinforcement and imitation multi-agent learning,”IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 2378–2385, 2019

2019
[20]

PRIMAL 2: Pathfinding via reinforcement and imitation multi-agent learning- lifelong,

M. Damani, Z. Luo, E. Wenzel, and G. Sartoretti, “PRIMAL 2: Pathfinding via reinforcement and imitation multi-agent learning- lifelong,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 2666–2673, 2021

2021
[21]

An autonomous cooperative navigation ap- proach for multiple unmanned ground vehicles in a variable commu- nication environment,

X. Lin and M. Huang, “An autonomous cooperative navigation ap- proach for multiple unmanned ground vehicles in a variable commu- nication environment,”Electronics, vol. 13, no. 15, p. 3028, 2024

2024
[22]

Learning agile soccer skills for a bipedal robot with deep reinforcement learning,

T. Haarnoja, B. Moran, G. Lever, S. H. Huang, D. Tirumala, J. Hump- lik, M. Wulfmeier, S. Tunyasuvunakool, N. Y . Siegel, R. Hafner et al., “Learning agile soccer skills for a bipedal robot with deep reinforcement learning,”Science Robotics, vol. 9, no. 89, p. eadi8022, 2024

2024
[23]

Data-efficient hi- erarchical reinforcement learning,

O. Nachum, S. S. Gu, H. Lee, and S. Levine, “Data-efficient hi- erarchical reinforcement learning,”Advances in neural information processing systems, vol. 31, 2018

2018
[24]

SLAP: Shortcut learning for abstract planning,

Y . I. Liu, B. Li, B. Eysenbach, and T. Silver, “SLAP: Shortcut learning for abstract planning,”arXiv preprint arXiv:2511.01107, 2025

work page arXiv 2025
[25]

SLAC: simulation-pretrained latent action space for whole-body real-world rl,

J. Hu, P. Stone, and R. Mart ´ın-Mart´ın, “SLAC: simulation-pretrained latent action space for whole-body real-world rl,”arXiv preprint arXiv:2506.04147, 2025

work page arXiv 2025
[26]

Challenges of real-world reinforcement learning: definitions, benchmarks and analysis,

G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru, S. Gowal, and T. Hester, “Challenges of real-world reinforcement learning: definitions, benchmarks and analysis,”Machine Learning, vol. 110, no. 9, pp. 2419–2468, 2021

2021
[27]

Domain randomization for transferring deep neural networks from simulation to the real world,

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in2017 IEEE/RSJ international con- ference on intelligent robots and systems (IROS). IEEE, 2017, pp. 23–30

2017
[28]

Assessing transferability from simulation to reality for reinforcement learning,

F. Muratore, M. Gienger, and J. Peters, “Assessing transferability from simulation to reality for reinforcement learning,”IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 4, pp. 1172– 1183, 2019

2019
[29]

Sim-to- real transfer of robotic control with dynamics randomization,

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to- real transfer of robotic control with dynamics randomization,” in2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 3803–3810

2018
[30]

Active domain randomization,

B. Mehta, M. Diaz, F. Golemo, C. J. Pal, and L. Paull, “Active domain randomization,” inConference on Robot Learning. PMLR, 2020, pp. 1162–1176

2020
[31]

Learning to reinforcement learn

J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick, “Learning to reinforcement learn,”arXiv preprint arXiv:1611.05763, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[32]

In-hand object rotation via rapid motor adaptation,

H. Qi, A. Kumar, R. Calandra, Y . Ma, and J. Malik, “In-hand object rotation via rapid motor adaptation,” inConference on Robot Learning. PMLR, 2023, pp. 1722–1732

2023
[33]

RMA: Rapid Motor Adaptation for Legged Robots

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,”arXiv preprint arXiv:2107.04034, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Hybrid internal model: Learning agile legged locomotion with simulated robot response,

J. Long, Z. Wang, Q. Li, J. Gao, L. Cao, and J. Pang, “Hybrid internal model: Learning agile legged locomotion with simulated robot response,”arXiv preprint arXiv:2312.11460, 2023

work page arXiv 2023
[35]

Learning to see physical properties with active sensing motor policies,

G. B. Margolis, X. Fu, Y . Ji, and P. Agrawal, “Learning to see physical properties with active sensing motor policies,”arXiv preprint arXiv:2311.01405, 2023

work page arXiv 2023
[36]

Asid: Active exploration for system identification in robotic manipu- lation,

M. Memmel, A. Wagenmaker, C. Zhu, P. Yin, D. Fox, and A. Gupta, “Asid: Active exploration for system identification in robotic manipu- lation,”arXiv preprint arXiv:2404.12308, 2024

work page arXiv 2024
[37]

The complexity of decentralized control of markov decision processes,

D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein, “The complexity of decentralized control of markov decision processes,” Mathematics of operations research, vol. 27, no. 4, pp. 819–840, 2002

2002
[38]

Peyr ´e and M

G. Peyr ´e and M. Cuturi,Computational optimal transport: With applications to data science. Now Foundations and Trends, 2019

2019
[39]

Lipschitz continuity in model- based reinforcement learning,

K. Asadi, D. Misra, and M. Littman, “Lipschitz continuity in model- based reinforcement learning,” inInternational conference on machine learning. PMLR, 2018, pp. 264–273

2018

[1] [1]

A survey on uav control with multi-agent reinforcement learning,

C. C. Ekechi, T. Elfouly, A. Alouani, and T. Khattab, “A survey on uav control with multi-agent reinforcement learning,”Drones, vol. 9, no. 7, p. 484, 2025

2025

[2] [2]

Graph-based multi-agent reinforcement learning for large-scale uavs swarm system control,

B. Zhao, M. Huo, Z. Li, Z. Yu, and N. Qi, “Graph-based multi-agent reinforcement learning for large-scale uavs swarm system control,” Aerospace Science and Technology, vol. 150, p. 109166, 2024

2024

[3] [3]

A survey of sim-to-real methods in rl: Progress, prospects and challenges with foundation models,

L. Da, J. Turnau, T. P. Kutralingam, A. Velasquez, P. Shakarian, and H. Wei, “A survey of sim-to-real methods in rl: Progress, prospects and challenges with foundation models,”arXiv preprint arXiv:2502.13187, 2025

work page arXiv 2025

[4] [4]

Solving Rubik's Cube with a Robot Hand

I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribaset al., “Solving rubik’s cube with a robot hand,”arXiv preprint arXiv:1910.07113, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[5] [5]

Closing the sim-to-real loop: Adapting simula- tion randomization with real world experience,

Y . Chebotar, A. Handa, V . Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox, “Closing the sim-to-real loop: Adapting simula- tion randomization with real world experience,” in2019 international conference on robotics and automation (ICRA). IEEE, 2019, pp. 8973–8979

2019

[6] [6]

A review of key technologies for friction nonlinearity in an electro-hydraulic servo system,

B. Gao, W. Shen, L. Zheng, W. Zhang, and H. Zhao, “A review of key technologies for friction nonlinearity in an electro-hydraulic servo system,”Machines, vol. 10, no. 7, p. 568, 2022

2022

[7] [7]

Hysteresis identification of joint with harmonic drive transmission based on monte carlo method,

Q. Wang, H. Wu, H. Handroos, Y . Song, M. Li, J. Yin, and Y . Cheng, “Hysteresis identification of joint with harmonic drive transmission based on monte carlo method,”Mechatronics, vol. 99, p. 103166, 2024

2024

[8] [8]

A survey of multi-agent deep reinforcement learning with communication,

C. Zhu, M. Dastani, and S. Wang, “A survey of multi-agent deep reinforcement learning with communication,”Autonomous Agents and Multi-Agent Systems, vol. 38, no. 1, p. 4, 2024

2024

[9] [9]

Sample-efficient robust multi-agent reinforcement learning in the face of environmental uncertainty,

L. Shi, E. Mazumdar, Y . Chi, and A. Wierman, “Sample-efficient robust multi-agent reinforcement learning in the face of environmental uncertainty,”arXiv preprint arXiv:2404.18909, 2024

work page arXiv 2024

[10] [10]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Mack- lin, D. Hoeller, N. Rudin, A. Allshire, A. Handaet al., “Isaac gym: High performance gpu-based physics simulation for robot learning,” arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Learning agile and dynamic motor skills for legged robots,

J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V . Tsounis, V . Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,”Science robotics, vol. 4, no. 26, p. eaau5872, 2019

2019

[12] [12]

Champion-level drone racing using deep reinforce- ment learning,

E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Champion-level drone racing using deep reinforce- ment learning,”Nature, vol. 620, no. 7976, pp. 982–987, 2023

2023

[13] [13]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

2022

[14] [14]

Grandmaster level in starcraft ii using multi-agent reinforcement learning,

O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgievet al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning,”nature, vol. 575, no. 7782, pp. 350–354, 2019

2019

[15] [15]

Bridging training and execu- tion via dynamic directed graph-based communication in cooperative multi-agent systems,

Z. Zhang, B. He, B. Cheng, and G. Li, “Bridging training and execu- tion via dynamic directed graph-based communication in cooperative multi-agent systems,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 22, 2025, pp. 23 395–23 403

2025

[16] [16]

Multi-agent deep reinforcement learn- ing: a survey,

S. Gronauer and K. Diepold, “Multi-agent deep reinforcement learn- ing: a survey,”Artificial Intelligence Review, vol. 55, no. 2, pp. 895– 943, 2022

2022

[17] [17]

The surprising effectiveness of ppo in cooperative multi- agent games,

C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . WU, “The surprising effectiveness of ppo in cooperative multi- agent games,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 24 611–24 624

2022

[18] [18]

Multi-agent reinforcement learning as a rehearsal for decentralized planning,

L. Kraemer and B. Banerjee, “Multi-agent reinforcement learning as a rehearsal for decentralized planning,”Neurocomputing, vol. 190, pp. 82–94, 2016

2016

[19] [19]

Primal: Pathfinding via reinforcement and imitation multi-agent learning,

G. Sartoretti, J. Kerr, Y . Shi, G. Wagner, T. S. Kumar, S. Koenig, and H. Choset, “Primal: Pathfinding via reinforcement and imitation multi-agent learning,”IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 2378–2385, 2019

2019

[20] [20]

PRIMAL 2: Pathfinding via reinforcement and imitation multi-agent learning- lifelong,

M. Damani, Z. Luo, E. Wenzel, and G. Sartoretti, “PRIMAL 2: Pathfinding via reinforcement and imitation multi-agent learning- lifelong,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 2666–2673, 2021

2021

[21] [21]

An autonomous cooperative navigation ap- proach for multiple unmanned ground vehicles in a variable commu- nication environment,

X. Lin and M. Huang, “An autonomous cooperative navigation ap- proach for multiple unmanned ground vehicles in a variable commu- nication environment,”Electronics, vol. 13, no. 15, p. 3028, 2024

2024

[22] [22]

Learning agile soccer skills for a bipedal robot with deep reinforcement learning,

T. Haarnoja, B. Moran, G. Lever, S. H. Huang, D. Tirumala, J. Hump- lik, M. Wulfmeier, S. Tunyasuvunakool, N. Y . Siegel, R. Hafner et al., “Learning agile soccer skills for a bipedal robot with deep reinforcement learning,”Science Robotics, vol. 9, no. 89, p. eadi8022, 2024

2024

[23] [23]

Data-efficient hi- erarchical reinforcement learning,

O. Nachum, S. S. Gu, H. Lee, and S. Levine, “Data-efficient hi- erarchical reinforcement learning,”Advances in neural information processing systems, vol. 31, 2018

2018

[24] [24]

SLAP: Shortcut learning for abstract planning,

Y . I. Liu, B. Li, B. Eysenbach, and T. Silver, “SLAP: Shortcut learning for abstract planning,”arXiv preprint arXiv:2511.01107, 2025

work page arXiv 2025

[25] [25]

SLAC: simulation-pretrained latent action space for whole-body real-world rl,

J. Hu, P. Stone, and R. Mart ´ın-Mart´ın, “SLAC: simulation-pretrained latent action space for whole-body real-world rl,”arXiv preprint arXiv:2506.04147, 2025

work page arXiv 2025

[26] [26]

Challenges of real-world reinforcement learning: definitions, benchmarks and analysis,

G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru, S. Gowal, and T. Hester, “Challenges of real-world reinforcement learning: definitions, benchmarks and analysis,”Machine Learning, vol. 110, no. 9, pp. 2419–2468, 2021

2021

[27] [27]

Domain randomization for transferring deep neural networks from simulation to the real world,

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in2017 IEEE/RSJ international con- ference on intelligent robots and systems (IROS). IEEE, 2017, pp. 23–30

2017

[28] [28]

Assessing transferability from simulation to reality for reinforcement learning,

F. Muratore, M. Gienger, and J. Peters, “Assessing transferability from simulation to reality for reinforcement learning,”IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 4, pp. 1172– 1183, 2019

2019

[29] [29]

Sim-to- real transfer of robotic control with dynamics randomization,

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to- real transfer of robotic control with dynamics randomization,” in2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 3803–3810

2018

[30] [30]

Active domain randomization,

B. Mehta, M. Diaz, F. Golemo, C. J. Pal, and L. Paull, “Active domain randomization,” inConference on Robot Learning. PMLR, 2020, pp. 1162–1176

2020

[31] [31]

Learning to reinforcement learn

J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick, “Learning to reinforcement learn,”arXiv preprint arXiv:1611.05763, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[32] [32]

In-hand object rotation via rapid motor adaptation,

H. Qi, A. Kumar, R. Calandra, Y . Ma, and J. Malik, “In-hand object rotation via rapid motor adaptation,” inConference on Robot Learning. PMLR, 2023, pp. 1722–1732

2023

[33] [33]

RMA: Rapid Motor Adaptation for Legged Robots

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,”arXiv preprint arXiv:2107.04034, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[34] [34]

Hybrid internal model: Learning agile legged locomotion with simulated robot response,

J. Long, Z. Wang, Q. Li, J. Gao, L. Cao, and J. Pang, “Hybrid internal model: Learning agile legged locomotion with simulated robot response,”arXiv preprint arXiv:2312.11460, 2023

work page arXiv 2023

[35] [35]

Learning to see physical properties with active sensing motor policies,

G. B. Margolis, X. Fu, Y . Ji, and P. Agrawal, “Learning to see physical properties with active sensing motor policies,”arXiv preprint arXiv:2311.01405, 2023

work page arXiv 2023

[36] [36]

Asid: Active exploration for system identification in robotic manipu- lation,

M. Memmel, A. Wagenmaker, C. Zhu, P. Yin, D. Fox, and A. Gupta, “Asid: Active exploration for system identification in robotic manipu- lation,”arXiv preprint arXiv:2404.12308, 2024

work page arXiv 2024

[37] [37]

The complexity of decentralized control of markov decision processes,

D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein, “The complexity of decentralized control of markov decision processes,” Mathematics of operations research, vol. 27, no. 4, pp. 819–840, 2002

2002

[38] [38]

Peyr ´e and M

G. Peyr ´e and M. Cuturi,Computational optimal transport: With applications to data science. Now Foundations and Trends, 2019

2019

[39] [39]

Lipschitz continuity in model- based reinforcement learning,

K. Asadi, D. Misra, and M. Littman, “Lipschitz continuity in model- based reinforcement learning,” inInternational conference on machine learning. PMLR, 2018, pp. 264–273

2018