pith. sign in

arxiv: 2605.22748 · v1 · pith:SH33GOYYnew · submitted 2026-05-21 · 💻 cs.RO · cs.AI· cs.LG· cs.MA

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

Pith reviewed 2026-05-22 04:52 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LGcs.MA
keywords multi-agent reinforcement learningquadrotor racingdrone racingcollision avoidanceself-playhuman-robot interactionsimulation to real transfer
0
0 comments X

The pith

Multi-agent reinforcement learning trains quadrotor agents that outperform champion human pilots in races while cutting collisions in half and generalizing safely to humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that treating other actors as noise in single-agent setups limits safety and coordination in shared physical spaces, and that multi-agent reinforcement learning supplies the missing structure for effective real-world interaction. In high-speed quadrotor racing, league-based training against variable numbers of artificial opponents produces agents that anticipate collisions, overtake, and manage downwash effects. This yields better results than single-agent baselines, including surpassing a champion human pilot at speeds above 22 m/s with 50 percent fewer collisions. The training further allows the agents to interact more safely with real human pilots without additional adaptation. The authors conclude that rigorous multi-agent demands build the robustness needed for robotic systems to coexist with people and other machines.

Core claim

Agents trained via league-based self-play in multi-agent reinforcement learning navigate complex aerodynamic interactions and strategic maneuvers with a variable number of racers, outperforming a champion-level human pilot in multi-player quadrotor races at speeds exceeding 22 m/s while reducing collision rates by 50 percent compared to state-of-the-art single-agent baselines, and enabling zero-shot generalization to safer human interaction.

What carries the argument

League-based self-play in multi-agent reinforcement learning, in which policies compete against diverse artificial opponents to evolve anticipatory collision avoidance, overtaking, and handling of physical effects such as aerodynamic downwash.

If this is right

  • Agents develop proactive collision avoidance and strategic overtaking through repeated interaction with variable opponents.
  • Policies successfully handle multi-agent physical effects including aerodynamic downwash.
  • Zero-shot generalization from artificial opponents to human pilots improves safety without human-specific retraining.
  • Multi-agent interaction provides a more effective safety foundation than isolated single-agent constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same league-training structure could be tested in other coordination-heavy settings such as multi-vehicle traffic or robot swarms.
  • Limits of the current simulation fidelity could be probed by introducing controlled mismatches in aerodynamic parameters and re-measuring transfer performance.
  • Safety might improve further by mixing a small number of human demonstrations into the self-play league.

Load-bearing premise

The simulation used for training accurately captures the complex aerodynamic interactions and physical dynamics that occur in real-world multi-quadrotor flight.

What would settle it

Real-world multi-player races pitting the trained agents against a champion human pilot at speeds above 22 m/s, with direct measurement of whether collision rates remain 50 percent lower than single-agent baselines.

Figures

Figures reproduced from arXiv: 2605.22748 by Davide Scaramuzza, Ismail Geles, Leonard Bauersfeld, Markus Wulfmeier.

Figure 1
Figure 1. Figure 1: A, Long-exposure photograph of real-world deployment with four agents competing simultaneously. B, Large-scale evaluation of over 64,000 simulated four-player races comparing average lap time against race completion rate. Real-world data points indicate median lap times from all four-player races for our policy and the expert human pilot. C, Crash rates from the large-scale evaluation in B, classified by c… view at source ↗
Figure 2
Figure 2. Figure 2: A, Average race completion through self-evaluation with identical policies from solo to 8-player races. Each data point represents four policies per method, each completing 64 races with varied starting positions. Shaded regions indicate the average of standard deviations per policy. B, Crash rates by collision type (gate, wall, opponent) from the races completed in A. C, Sample races illustrating behavior… view at source ↗
Figure 3
Figure 3. Figure 3: A, Learned value function visualized by varying the ego agent position in the (X, Z) plane while fixing opponent positions. Trajec￾tories are from a real-world overtaking maneuver at the Split-S gate, where competitors pass through an upper gate followed by a lower gate. The ego agent (orange) overtakes opponent 2 (red) across the four time steps shown. Low-value regions near opponents reflect learned coll… view at source ↗
read the original abstract

Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, preventing effective coordination. Here we show that multi-agent reinforcement learning provides the essential safety scaffolding required for real-world interaction. Using high-speed quadrotor racing as a high-stakes testbed, we train agents to navigate complex aerodynamic interactions and strategic maneuvering with a variable number of racers. Through league-based self-play, agents evolve sophisticated anticipatory behaviors, including proactive collision avoidance, overtaking, and handling multi-agent physical interactions, including aerodynamic downwash. Our agents outperform a champion-level human pilot in multi-player races at speeds exceeding 22 m/s, while simultaneously reducing collision rates by 50 % compared to state-of-the-art single-agent baselines. Crucially, training with diverse artificial agents enables zero-shot generalization to safer human interaction. These results suggest that the path to robust robotic co-existence lies not in isolated safety constraints, but in the rigorous demands of multi-agent interaction. Multimedia materials are available at: https://rpg.ifi.uzh.ch/marl

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a multi-agent reinforcement learning approach for high-speed quadrotor racing. Agents are trained via league-based self-play to navigate aerodynamic interactions, collisions, and strategic maneuvers with variable numbers of opponents. The central claims are that the resulting policies outperform a champion-level human pilot at speeds exceeding 22 m/s, achieve a 50% reduction in collision rates relative to state-of-the-art single-agent baselines, and enable zero-shot generalization to safer interactions with human pilots.

Significance. If the performance and generalization results are substantiated with additional validation, the work would demonstrate that multi-agent self-play can produce anticipatory, safe behaviors that transfer beyond simulation more effectively than single-agent training. The direct quantitative comparison to human performance and the emphasis on emergent coordination from diverse opponents constitute a concrete advance for physical multi-robot systems. The absence of reported sim-to-real metrics, however, leaves the transfer claims provisional.

major comments (2)
  1. [Abstract] Abstract: the headline claim of a 50% collision-rate reduction versus single-agent baselines is load-bearing for the safety argument, yet the manuscript provides no details on the number of evaluation trials, variance, or statistical tests used to establish this figure.
  2. [Abstract] Abstract and results: the zero-shot generalization to human pilots at >22 m/s rests on the unverified assumption that the simulated aerodynamic model (downwash, wake effects) matches physical reality at the relevant separations and speeds; no quantitative sim-to-real validation (force errors, velocity-field comparisons, or hardware collision statistics) is reported, which directly affects the central claim that multi-agent training supplies the necessary safety scaffolding.
minor comments (2)
  1. [Methods] The reward-function description would benefit from an explicit equation or table listing the weights on collision, downwash, and progress terms so that readers can assess how much of the reported safety is emergent versus directly shaped.
  2. [Results] Figure captions for the race trajectories and collision heat-maps should include the exact number of runs and the identity of the single-agent baselines used for comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation without overstating the current results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of a 50% collision-rate reduction versus single-agent baselines is load-bearing for the safety argument, yet the manuscript provides no details on the number of evaluation trials, variance, or statistical tests used to establish this figure.

    Authors: We agree that statistical details are essential to support the collision-rate claim. The 50% figure derives from comparative evaluations in our simulation environment across randomized multi-agent racing scenarios. In the revised manuscript we will report the exact number of evaluation trials per condition, the observed means and variances (or standard deviations) in collision rates, and the results of appropriate statistical tests (e.g., t-test or Wilcoxon rank-sum) to establish significance. These additions will appear in the results section and be referenced concisely in the abstract. revision: yes

  2. Referee: [Abstract] Abstract and results: the zero-shot generalization to human pilots at >22 m/s rests on the unverified assumption that the simulated aerodynamic model (downwash, wake effects) matches physical reality at the relevant separations and speeds; no quantitative sim-to-real validation (force errors, velocity-field comparisons, or hardware collision statistics) is reported, which directly affects the central claim that multi-agent training supplies the necessary safety scaffolding.

    Authors: We thank the referee for underscoring the distinction between simulated and physical environments. The reported zero-shot generalization experiments place a human pilot in the loop inside the identical high-fidelity simulator used for training, so the aerodynamic interaction model (downwash, wake) remains consistent by construction. We acknowledge that the manuscript contains no quantitative sim-to-real metrics or hardware collision data. In revision we will explicitly state in the abstract and results that human-interaction trials occur in simulation, and we will add a limitations paragraph in the discussion that addresses the current absence of physical validation while outlining planned future hardware experiments. This clarifies the scope of the present claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are empirical measurements against external baselines

full rationale

The paper's core claims rest on training multi-agent RL policies via league self-play in simulation and then measuring lap times, collision rates, and human-interaction outcomes against independent external references (champion human pilot and single-agent baselines). No derivation step reduces a claimed prediction to a fitted parameter or self-citation by construction; the reward design, while including safety terms, does not tautologically produce the reported 50% collision reduction or zero-shot human generalization, which are evaluated post-training on held-out scenarios. The work is therefore self-contained against its stated benchmarks rather than circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the fidelity of the physics simulator and the design of the multi-agent reward function; no new physical entities are postulated.

free parameters (1)
  • multi-agent reward weights
    Coefficients balancing progress, collision penalties, and aerodynamic interaction terms must be chosen or tuned to produce the reported behaviors.
axioms (1)
  • domain assumption The simulation environment faithfully reproduces real quadrotor aerodynamics including downwash effects.
    All training occurs in simulation; zero-shot transfer to hardware and humans is asserted without additional domain randomization details in the abstract.

pith-pipeline@v0.9.0 · 5759 in / 1391 out tokens · 50310 ms · 2026-05-22T04:52:28.739555+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

  1. [1]

    T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, M. Hutter, Learn- ing robust perceptive locomotion for quadrupedal robots in the wild, Science Roboticsp. eabk2822 (2022)

  2. [2]

    J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, M. Hutter, Learning quadrupedal locomotion over challenging terrain,Science Roboticsp. eabc5986 (2020)

  3. [3]

    J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, T. Funkhouser, Tidybot: personalized robot assistance with large language models,Auton. Robotsp. 1087–1102 (2023)

  4. [4]

    Z. Fu, T. Z. Zhao, C. Finn, Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,Conference on Robot Learning (CoRL)(2024)

  5. [5]

    Loquercio, E

    A. Loquercio, E. Kaufmann, R. Ranftl, M. Müller, V. Koltun, D. Scara- muzza, Learning high-speed flight in the wild,Science Roboticsp. eabg5810 (2021)

  6. [6]

    Kaufmann, L

    E. Kaufmann, L. Bauersfeld, A. Loquercio, M. Müller, V. Koltun, D. Scaramuzza, Champion-level drone racing using deep reinforcement learning,Nature982–987 (2023)

  7. [7]

    Y . Song, A. Romero, M. Müller, V. Koltun, D. Scaramuzza, Reaching the limit in autonomous racing: Optimal control versus reinforcement learning,Science Roboticsp. eadg1462 (2023)

  8. [8]

    S. A. Bahnam, R. Ferede, T. M. Blaha, A. E. Lang, E. Lucassen, Q. Missinne, A. E. Verraest, C. De Wagter, G. C. de Croon, Monorace: Winning champion-level drone racing with robust monocular ai,arXiv preprint arXiv:2601.15222(2026)

  9. [9]

    Geles, L

    I. Geles, L. Bauersfeld, A. Romero, J. Xing, D. Scaramuzza, Demon- strating Agile Flight from Pixels without State Estimation,Proceedings of Robotics: Science and Systems(2024). Research Article: Preprint 2026 University of Zurich and Google DeepMind 12

  10. [10]

    Gronauer, K

    S. Gronauer, K. Diepold, Multi-agent deep reinforcement learning: a survey,Artificial Intelligence Review895–943 (2022)

  11. [11]

    Cusumano-Towner, D

    M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wijmans, T. Killian, S. Bowers, O. Sener, P . Krähenbühl, V. Koltun, Robust autonomy emerges from self-play,Proceedings of the 42nd International Conference on Machine Learning, ICML ’25 (JMLR.org, 2025)

  12. [12]

    P . R. Wurman, S. Barrett, K. Kawamoto, J. MacGlashan, K. Subra- manian, T. J. Walsh, R. Capobianco, A. Devlic, F . Eckert, F . Fuchs, L. Gilpin, P . Khandelwal, V. Kompella, H. Lin, P . MacAlpine, D. Oller, T. Seno, C. Sherstan, M. D. Thomure, H. Aghabozorgi, L. Barrett, R. Douglas, D. Whitehead, P . Dürr, P . Stone, M. Spranger, H. Kitano, Outracing cha...

  13. [13]

    Bowling, N

    M. Bowling, N. Burch, M. Johanson, O. Tammelin, Heads-up limit hold’em poker is solved,Science145–149 (2015)

  14. [14]

    Brown, T

    N. Brown, T. Sandholm, Superhuman ai for heads-up no-limit poker: Libratus beats top professionals,Science418–424 (2018)

  15. [15]

    Brown, T

    N. Brown, T. Sandholm, Superhuman ai for multiplayer poker,Science 885–890 (2019)

  16. [16]

    Tesauro, Temporal difference learning and TD-Gammon,Communi- cations of the ACM58–68 (1995)

    G. Tesauro, Temporal difference learning and TD-Gammon,Communi- cations of the ACM58–68 (1995)

  17. [17]

    Silver, A

    D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Has- sabis, Mastering the game of Go with deep neural networks and tree search,Nature484–489 (2016)

  18. [18]

    Silver, J

    D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Mastering the game of Go without human knowledge,Nature354–359 (2017)

  19. [19]

    Vinyals, I

    O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P . Georgiev, Grandmaster level in StarCraft II using multi-agent reinforcement learning,Nature 350–354 (2019)

  20. [20]

    Dota 2 with Large Scale Deep Reinforcement Learning

    C. Berner, G. Brockman, B. Chan, V. Cheung, P . D˛ ebiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Jozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P . d.O. Pinto, J. Raiman, T. Sali- mans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F . Wol- ski, S. Zhang, Dota 2 with large scale deep reinforcement learning, a...

  21. [21]

    Jaderberg, W

    M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castaneda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruder- man, others, Human-level performance in 3D multiplayer games with population-based reinforcement learning,Science859–865 (2019)

  22. [22]

    Lanctot, V

    M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, T. Graepel, A unified game-theoretic approach to multia- gent reinforcement learning,Advances in Neural Information Processing Systems30(2017)

  23. [23]

    Balduzzi, M

    D. Balduzzi, M. Garnelo, Y . Bachrach, W. Czarnecki, J. Perolat, M. Jaderberg, T. Graepel, Open-ended learning in symmetric zero- sum games,International Conference on Machine Learning434–443 (2019)

  24. [24]

    S. Liu, G. Lever, Z. Wang, J. Merel, S. M. A. Eslami, D. Hennes, W. M. Czarnecki, Y . Tassa, S. Omidshafiei, A. Abdolmaleki, N. Y . Siegel, L. Hasenclever, L. Marris, S. Tunyasuvunakool, H. F . Song, M. Wulfmeier, P . Muller, T. Haarnoja, B. Tracey, K. Tuyls, T. Graepel, N. Heess, From motor control to team play in simulated humanoid football,Science Robo...

  25. [25]

    Baker, I

    B. Baker, I. Kanitscheider, T. M. Markov, Y . Wu, G. Powell, B. McGrew, I. Mordatch, Emergent tool use from multi-agent autocurricula,8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020(2020)

  26. [26]

    Bansal, J

    T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, I. Mordatch, Emergent complexity via multi-agent competition,International Conference on Learning Representations(2018)

  27. [27]

    Haarnoja, B

    T. Haarnoja, B. Moran, G. Lever, S. H. Huang, D. Tirumala, J. Humplik, M. Wulfmeier, S. Tunyasuvunakool, N. Y . Siegel, R. Hafner, others, Learning agile soccer skills for a bipedal robot with deep reinforcement learning,Science Roboticsp. eadi8022 (2024)

  28. [28]

    Tirumala, M

    D. Tirumala, M. Wulfmeier, B. Moran, S. Huang, J. Humplik, G. Lever, T. Haarnoja, L. Hasenclever, A. Byravan, N. Batchelor, N. sreendra, K. Patel, M. Gwira, F . Nori, M. Riedmiller, N. Heess, Learning robot soccer from egocentric vision with deep reinforcement learning,8th Annual Conference on Robot Learning(2024)

  29. [29]

    Werner, T

    P . Werner, T. Seyde, P . Drews, T. M. Balch, I. Gilitschenski, W. Schwart- ing, G. Rosman, S. Karaman, D. Rus, Dynamic multi-team racing: Competitive driving on 1/10-th scale vehicles via learning in simulation, 7th Annual Conference on Robot Learning(2023)

  30. [30]

    Pasumarti, L

    V. Pasumarti, L. Bianchi, A. Loquercio, Agile flight emerges from multi-agent competitive racing,2026 IEEE International Conference on Robotics and Automation (ICRA)(2026)

  31. [31]

    Busoniu, R

    L. Busoniu, R. Babuska, B. De Schutter, A comprehensive survey of multiagent reinforcement learning,IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)156–172 (2008)

  32. [32]

    J. A. Preiss, W. Honig, G. S. Sukhatme, N. Ayanian, Crazyswarm: A large nano-quadcopter swarm,IEEE International Conference on Robotics and Automation3299–3304 (2017)

  33. [33]

    Spica, D

    R. Spica, D. Falanga, E. Cristofalo, E. Montijano, D. Scaramuzza, M. Schwager, A real-time game theoretic planner for autonomous two-player drone racing,Robotics: Science and Systems(2018)

  34. [34]

    X. Zhou, X. Wen, Z. Wang, Y . Gao, H. Li, Q. Wang, T. Y ang, H. Lu, Y . Cao, C. Xu, F . Gao, Swarm of micro flying robots in the wild,Science Roboticsp. eabm5954 (2022)

  35. [35]

    Heinrich, M

    J. Heinrich, M. Lanctot, D. Silver, Fictitious self-play in extensive-form games,International Conference on Machine Learning805–813 (2015)

  36. [36]

    Deep Reinforcement Learning from Self-Play in Imperfect-Information Games

    J. Heinrich, D. Silver, Deep reinforcement learning from self-play in imperfect-information games,arXiv preprint arXiv:1603.01121(2016)

  37. [37]

    Jaegle, F

    A. Jaegle, F . Gimeno, A. Brock, O. Vinyals, A. Zisserman, J. Carreira, Perceiver: General perception with iterative attention,International Conference on Machine Learning4651–4664 (2021)

  38. [38]

    Proximal Policy Optimization Algorithms

    J. Schulman, F . Wolski, P . Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms,arXiv preprint arXiv:1707.06347(2017)

  39. [39]

    Hochreiter, J

    S. Hochreiter, J. Schmidhuber, Long short-term memory,Neural Com- put.p. 1735–1780 (1997)

  40. [40]

    H. Wang, J. Xing, N. Messikommer, D. Scaramuzza, Environment as policy: Learning to race in unseen tracks,2025 IEEE International Conference on Robotics and Automation (ICRA), 11333–11339 (2025)

  41. [41]

    Y . Song, S. Naji, E. Kaufmann, A. Loquercio, D. Scaramuzza, Flight- mare: A flexible quadrotor simulator,Proceedings of the 2020 Conference on Robot Learning1147–1157 (2021)

  42. [42]

    Foehn, E

    P . Foehn, E. Kaufmann, A. Romero, R. Penicka, S. Sun, L. Bauersfeld, T. Laengle, G. Cioffi, Y . Song, A. Loquercio, D. Scaramuzza, Agilicious: Open-source and open-hardware agile quadrotor for vision-based flight, Science Roboticsp. eabl6259 (2022)

  43. [43]

    Bauersfeld, E

    L. Bauersfeld, E. Kaufmann, P . Foehn, S. Sun, D. Scaramuzza, Neu- robem: Hybrid aerodynamic quadrotor model,RSS: Robotics, Science, and Systems(2021)

  44. [44]

    Bauersfeld, K

    L. Bauersfeld, K. Muller, D. Ziegler, F . Coletti, D. Scaramuzza, Robotics meets fluid dynamics: A characterization of the induced airflow below a quadrotor as a turbulent jet,IEEE Robotics and Automation Letters 1241–1248 (2025)

  45. [45]

    M. L. Littman. Markov games as a framework for multi-agent rein- forcement learning.Machine Learning Proceedings 1994, W. W. Cohen, H. Hirsh, eds. (Morgan Kaufmann, San Francisco (CA), 1994), 157– 163

  46. [46]

    Raffin, A

    A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, N. Dormann, Stable-baselines3: Reliable reinforcement learning implementations, Journal of Machine Learning Research1–8 (2021). Research Article: Preprint 2026 University of Zurich and Google DeepMind 13 SUPPLEMETARY MATERIAL Reward coefficients Table S1 lists the coefficients used in the reward f...