RAPTOR: A Foundation Policy for Quadrotor Control

Dario Albani; Giuseppe Loianno; Jonas Eschmann

arxiv: 2509.11481 · v2 · submitted 2025-09-15 · 💻 cs.RO · cs.AI· cs.LG

RAPTOR: A Foundation Policy for Quadrotor Control

Jonas Eschmann , Dario Albani , Giuseppe Loianno This is my paper

Pith reviewed 2026-05-18 17:23 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords quadrotor controlzero-shot adaptationmeta-imitation learningrecurrent policyreinforcement learningsim-to-real transferfoundation policyin-context learning

0 comments

The pith

A tiny recurrent policy adapts zero-shot to many different quadrotors

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a single three-layer neural network with 2084 parameters can control quadrotors across wide hardware differences by learning to adapt from its own recent history. Training proceeds by sampling 1000 varied quadrotors in simulation, training a separate reinforcement learning teacher for each, and distilling all teachers into one student policy whose recurrence supports rapid in-context adjustment. This approach targets the brittleness of current robotic controllers that require system identification and retraining for even minor platform changes. If the claim holds, one policy could handle trajectory tracking, wind, and disturbances on new real quadrotors from 32 g to 2.4 kg with different motors, frames, propellers, and flight controllers.

Core claim

RAPTOR trains a foundation policy for quadrotor control by first creating 1000 specialized teacher policies through reinforcement learning on distinct simulated platforms, then distilling them into one recurrent student policy. The student uses its hidden-layer recurrence to adapt its behavior within milliseconds to unseen real quadrotors, achieving zero-shot transfer without online adaptation or system identification.

What carries the argument

The recurrent hidden layer in the policy network that maintains internal state to support in-context learning from recent observations and actions.

If this is right

The same policy performs trajectory tracking on all tested real platforms without fine-tuning.
It maintains control under wind disturbances and physical poking.
Performance holds for both indoor and outdoor flights.
Adaptation completes in milliseconds, supporting real-time use across hardware types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation approach could produce foundation policies for other variable hardware robots such as manipulators.
Recurrent state might substitute for separate adaptive controllers in many robotic tasks.
Expanding the simulation distribution to include more environmental factors would test broader generalization.

Load-bearing premise

The 1000 sampled quadrotors in simulation capture enough real-world variation in motor response, frame flexibility, propeller aerodynamics, and controller latency for the distilled policy to transfer directly.

What would settle it

A real quadrotor whose motor curves, frame stiffness, or latency fall outside the range represented in the 1000 simulated samples would cause the policy to lose stability or fail at trajectory tracking.

Figures

Figures reproduced from arXiv: 2509.11481 by Dario Albani, Giuseppe Loianno, Jonas Eschmann.

read the original abstract

Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car. In contrast, modern robotic control systems, like neural network policies trained using Reinforcement Learning (RL), are highly specialized for single environments. Because of this overfitting, they are known to break down even under small differences like the Simulation-to-Reality (Sim2Real) gap and require system identification and retraining for even minimal changes to the system. In this work, we present RAPTOR, a method for training a highly adaptive foundation policy for quadrotor control. Our method enables training a single, end-to-end neural-network policy to control a wide variety of quadrotors. We test 10 different real quadrotors from 32 g to 2.4 kg that also differ in motor type (brushed vs. brushless), frame type (soft vs. rigid), propeller type (2/3/4-blade), and flight controller (PX4/Betaflight/Crazyflie/M5StampFly). We find that a tiny, three-layer policy with only 2084 parameters is sufficient for zero-shot adaptation to a wide variety of platforms. The adaptation through in-context learning is made possible by using a recurrence in the hidden layer. The policy is trained through our proposed Meta-Imitation Learning algorithm, where we sample 1000 quadrotors and train a teacher policy for each of them using RL. Subsequently, the 1000 teachers are distilled into a single, adaptive student policy. We find that within milliseconds, the resulting foundation policy adapts zero-shot to unseen quadrotors. We extensively test the capabilities of the foundation policy under numerous conditions (trajectory tracking, indoor/outdoor, wind disturbance, poking, different propellers).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAPTOR distills 1000 RL teachers into one 2084-parameter recurrent policy that claims zero-shot control on 10 real quadrotors spanning big hardware differences, but the quantitative backing stays thin.

read the letter

The main point is that a tiny recurrent policy trained by meta-imitation from many simulated teachers can handle real quadrotors that differ in mass, motors, frames, props, and controllers without retraining or system ID. That combination of scale in teacher diversity and recurrence for quick adaptation is the concrete step forward here. It moves past single-platform RL by building the adaptation into the student through distillation rather than online updates. The real-platform tests across indoor/outdoor flights, wind, and physical poking give the claim some grounding that pure simulation papers often lack. The choice to keep the policy at three layers and under 2100 parameters is practical and shows the method does not require heavy compute at inference. Credit is due for shipping tests on actual hardware variation instead of stopping at sim results. The soft spot is the missing numbers. The abstract and summary describe success but give no tracking errors, failure rates, baseline comparisons, or ablations on teacher count or recurrence depth, so the size of the improvement over simpler fine-tuning or domain randomization stays unclear. The sampling concern also lands: without reported ranges for thrust curves, stiffness, or delays in the 1000 teachers, it is hard to know whether the real platforms truly test extrapolation or just sit inside the simulated distribution. If the paper supplies those details and the metrics in the full text, the central claim strengthens; otherwise the generalization story rests on qualitative observation. This is for control researchers and drone teams who want to cut down on per-platform retraining. Readers working on meta-learning or foundation policies for robotics will see value in the hardware breadth and the small policy size. It deserves a serious referee because the method and the real-robot scope are substantial enough to review, even if the current evidence needs tightening on metrics and sampling coverage. I would send it to review with requests for quantitative results and explicit parameter distributions.

Referee Report

2 major / 0 minor

Summary. The paper claims to introduce RAPTOR, a method for training a small recurrent neural network policy (three layers, 2084 parameters) for quadrotor control using meta-imitation learning. RL teacher policies are trained for each of 1000 sampled simulated quadrotors and distilled into a single adaptive student policy. This policy is said to enable zero-shot adaptation to 10 real quadrotors varying in mass from 32g to 2.4kg, motor types (brushed/brushless), frame types (soft/rigid), propeller types (2/3/4-blade), and flight controllers (PX4/Betaflight/Crazyflie/M5StampFly). The adaptation is attributed to the recurrent hidden state allowing in-context learning, and the policy is tested on trajectory tracking, indoor/outdoor, wind disturbance, poking, and different propellers.

Significance. If the results hold, the work would be significant for demonstrating that a compact foundation policy can achieve broad zero-shot generalization across diverse real-world quadrotor hardware without retraining or system identification. This could reduce the engineering effort for deploying control policies on new platforms. The small parameter count is a strength, and the use of recurrence for adaptation is an interesting approach. The extensive testing on multiple real platforms under varied conditions provides a good starting point for validation, though quantitative details are needed to fully assess impact.

major comments (2)

The abstract states successful zero-shot tests on 10 real platforms under varied conditions, but provides no quantitative metrics, baselines, error bars, or ablation results. This is load-bearing for the central claim of effective adaptation, as qualitative success alone does not substantiate the performance of the 2084-parameter policy.
The sampling procedure for the 1000 quadrotors lacks specification of the parameter ranges and variance for dynamics properties like motor thrust curves, frame flexibility, propeller aerodynamics, and flight-controller latency. This information is necessary to evaluate whether the real-world test platforms represent genuine extrapolation beyond the simulated distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address each major comment point by point below, providing the strongest honest defense of the work while acknowledging where revisions are warranted to strengthen the presentation.

read point-by-point responses

Referee: The abstract states successful zero-shot tests on 10 real platforms under varied conditions, but provides no quantitative metrics, baselines, error bars, or ablation results. This is load-bearing for the central claim of effective adaptation, as qualitative success alone does not substantiate the performance of the 2084-parameter policy.

Authors: We agree that quantitative support is essential for the central claims. The full manuscript reports quantitative results including position and attitude tracking RMSE, success rates over repeated trials, and comparisons against non-recurrent baselines and ablations removing the recurrent state. To directly address the concern, we will revise the abstract to include key quantitative highlights (e.g., typical tracking errors and adaptation timescales) and ensure error bars from multiple runs plus ablation tables are clearly presented and referenced in the main text. revision: yes
Referee: The sampling procedure for the 1000 quadrotors lacks specification of the parameter ranges and variance for dynamics properties like motor thrust curves, frame flexibility, propeller aerodynamics, and flight-controller latency. This information is necessary to evaluate whether the real-world test platforms represent genuine extrapolation beyond the simulated distribution.

Authors: We acknowledge that the submitted version does not explicitly tabulate the full sampling ranges and variances in the main text. The manuscript describes sampling 1000 quadrotors but leaves the precise distributions for thrust curves, frame stiffness, propeller aerodynamics, and latency implicit. We will add a dedicated paragraph and summary table in the methods section (and expand the appendix) that specifies the uniform and Gaussian ranges used for each property. This revision will allow readers to assess how the real platforms relate to the training distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the meta-imitation learning pipeline

full rationale

The paper describes an empirical training procedure in which 1000 simulated quadrotors are each assigned an independent RL teacher policy, after which the teachers are distilled into a single recurrent student policy whose hidden state enables in-context adaptation. This pipeline does not contain any self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the reported zero-shot performance on real platforms back to the input sampling distribution by construction. The central result is an experimental outcome measured on ten distinct physical vehicles whose dynamics lie outside the training set, making the derivation self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that simulated quadrotor variations are representative of real hardware differences and that the recurrent hidden state suffices for in-context adaptation without explicit system identification.

free parameters (2)

number of sampled quadrotors
1000 quadrotors are sampled to generate the teacher policies; the exact sampling distribution and parameter ranges are not specified.
policy architecture size
The three-layer network is fixed at 2084 parameters; this size is a modeling choice that enables the reported adaptation.

axioms (1)

domain assumption Quadrotor dynamics in simulation are sufficiently accurate to produce teachers whose behavior transfers to real hardware via distillation.
The entire meta-training pipeline depends on this transfer assumption.

pith-pipeline@v0.9.0 · 5856 in / 1305 out tokens · 43791 ms · 2026-05-18T17:23:30.123638+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Meta-Imitation Learning algorithm... sample 1000 quadrotors and train a teacher policy for each... distill into a single adaptive student policy... recurrence in the hidden layer
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

emergent implicit system identification... thrust-to-weight ratio... linear probe... R² of 0.949

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 5.0

An RL-based outer-loop quadrotor controller augmented with an online Residual Dynamics Predictor for disturbance estimation and a data-efficient sim-to-real calibration bridge.
Autonomous UAV Pipeline Near-proximity Inspection via Disturbance-Aware Predictive Visual Servoing
cs.RO 2026-04 unverdicted novelty 5.0

The ESKF-PRE-VMPC framework couples quadrotor dynamics with image-feature prediction and disturbance estimation to enable autonomous near-proximity pipeline inspection that outperforms baselines in straight, windy, an...

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 2 Pith papers · 10 internal anchors

[1]

Whatsize(number of parameters) does the recurrent neural network policy require to express this behavior? Can it run in hard real-time at high frequencies when deployed onsmall microcontrollers?

work page
[2]

Will the policyforgetthe system dynamics after a short time?

Whatcontext windowis feasible? Recurrent neural networks are notoriously hard to train for sequences longer than 100−200 steps. Will the policyforgetthe system dynamics after a short time?

work page
[3]

Does the policygeneralizeto unseen quadrotors that are 1) in-distribution and 2) out-of- distribution?

work page
[4]

How muchtimeis required from activating the policy until it has gathered enough information to stably control the quadrotor? Is this feasible mid-flight, or would the quadrotor crash before the policy has identified the system properly?

work page
[5]

Is there a trade-off between agility and adaptability? We tackle the question of feasibility 1) by devising a method to train such a foundation policy for quadrotor control, implementing it, and testing it on a range of real-world quadrotors. We tackle the size and inference speed question 2) by studying the scaling laws (13) in the student policy and by ...

work page 2084
[6]

A quantitatively wide range of parameters: •Weight: 31.9 g - 2.4 kg •Size: 65 mm - 500 mm •Thrust-to-weight:≈1.75 - 12

work page
[7]

This shows that our proposed RAPTOR framework actually produces a policy that not only generalizes to quadrotors that are in the training distribution (cf

A qualitatively diverse set of features: •Flight controller: PX4, Betaflight, Crazyflie, M5StampFly 10 •State estimator: EKF, Mahony, Madgwick •Motor type: brushed and brushless •Flexible frame •Mixing two- and three-blade propellers Many of these quantities are (far) out-of-distribution, like a thrust-to-weight ratio of 12 (≤5 in training), a flexible fr...

work page
[8]

Switching from TD3 to SAC because we observed slightly more robust training dynamics in SAC

work page
[9]

Training for longer to ensure convergence for all quadrotors

work page
[10]

Adjusting the reward function, adding a penalty for termination and for the action derivative. 20

work page
[11]

Removing the curriculum because we found that the changes to the reward function stabilize the training without the need for a curriculum

work page
[12]

Ground-truth motor RPM states. The teacher policies are never deployed in reality, so instead of feeding a proprioceptive action history to account for the unobservable motor states as in (7), the teachers can directly observe the ground-truth motor states. This also makes the actor-critic architecture symmetric. We do these modifications to trade off wal...

work page
[13]

Figure 7A), we design a Gated Recurrent Unit (GRU) (47)-based foundation policy architecture as displayed in Figure 7B

Due to the variable number of past steps (cf. Figure 7A), we design a Gated Recurrent Unit (GRU) (47)-based foundation policy architecture as displayed in Figure 7B. The relatively small hidden dimensionality of 16 is justified by the scaling experiments in Section 2.3. Due to the recurrence, the policy can theoretically ”access” all the previous observat...

work page
[14]

G. Li, X. Liu, G. Loianno, Human-Aware Physical Human–Robot Collaborative Transportation and Manipulation With Multiple Aerial Robots.IEEE Transactions on Robotics41, 762–781 (2025), doi:10.1109/TRO.2024.3502508

work page doi:10.1109/tro.2024.3502508 2025
[15]

A. Ollero,et al., The AEROARMS Project: Aerial Robots with Advanced Manipulation Ca- pabilities for Inspection and Maintenance.IEEE Robotics and Automation Magazine25(4), 12–23 (2018), doi:10.1109/MRA.2018.2852789

work page doi:10.1109/mra.2018.2852789 2018
[16]

M. Tranzatto,et al., CERBERUS in the DARPA Subterranean Challenge.Science Robotics 7(66), eabp9742 (2022), doi:10.1126/scirobotics.abp9742,https://www.science.org/ doi/abs/10.1126/scirobotics.abp9742

work page doi:10.1126/scirobotics.abp9742 2022
[17]

Y. Song, A. Romero, M. M¨ uller, V. Koltun, D. Scaramuzza, Reaching the limit in autonomous racing: Optimal control versus reinforcement learning.Science Robotics 8(82), eadg1462 (2023), doi:10.1126/scirobotics.adg1462,https://www.science.org/ doi/abs/10.1126/scirobotics.adg1462

work page doi:10.1126/scirobotics.adg1462 2023
[18]

Champion-level drone racing using deep reinforcement learning,

E. Kaufmann,et al., Champion-level drone racing using deep reinforcement learning.Nature 620(7976), 982–987 (2023), doi:10.1038/s41586-023-06419-4

work page doi:10.1038/s41586-023-06419-4 2023
[19]

Ferede, T

R. Ferede, T. Blaha, E. Lucassen, C. De Wagter, G. C. de Croon, One Net to Rule Them All: Domain Randomization in Quadcopter Racing Across Different Platforms.arXiv preprint arXiv:2504.21586(2025)

work page arXiv 2025
[20]

Eschmann, D

J. Eschmann, D. Albani, G. Loianno, Learning to Fly in Seconds.IEEE Robotics and Automa- tion Letters9(7), 6336–6343 (2024), doi:10.1109/LRA.2024.3396025

work page doi:10.1109/lra.2024.3396025 2024
[21]

X. B. Peng, M. Andrychowicz, W. Zaremba, P. Abbeel, Sim-to-Real Transfer of Robotic Control with Dynamics Randomization, inIEEE International Conference on Robotics and Automation (ICRA)(2018), pp. 3803–3810, doi:10.1109/ICRA.2018.8460528

work page doi:10.1109/icra.2018.8460528 2018
[22]

Loquercio,et al., Deep drone racing: From simulation to reality with domain randomization

A. Loquercio,et al., Deep drone racing: From simulation to reality with domain randomization. IEEE Transactions on Robotics36(1), 1–14 (2019)

work page 2019
[23]

Hanover,et al., Autonomous drone racing: A survey.IEEE Transactions on Robotics40, 3044–3067 (2024)

D. Hanover,et al., Autonomous drone racing: A survey.IEEE Transactions on Robotics40, 3044–3067 (2024)

work page 2024
[24]

Radford,et al., Learning Transferable Visual Models From Natural Language Supervision, inProceedings of the 38th International Conference on Machine Learning, M

A. Radford,et al., Learning Transferable Visual Models From Natural Language Supervision, inProceedings of the 38th International Conference on Machine Learning, M. Meila, T. Zhang, Eds. (PMLR), vol. 139 ofProceedings of Machine Learning Research(2021), pp. 8748–8763, https://proceedings.mlr.press/v139/radford21a.html

work page 2021
[25]

Brown,et al., Language models are few-shot learners.Advances in neural information processing systems33, 1877–1901 (2020)

T. Brown,et al., Language models are few-shot learners.Advances in neural information processing systems33, 1877–1901 (2020)

work page 1901
[26]

Scaling Laws for Neural Language Models

J. Kaplan,et al., Scaling laws for neural language models.arXiv preprint arXiv:2001.08361 (2020). 24

work page internal anchor Pith review Pith/arXiv arXiv 2001
[27]

Varadarajan, A

E. Kaufmann, L. Bauersfeld, D. Scaramuzza, A Benchmark Comparison of Learned Control Policies for Agile Quadrotor Flight, inInternational Conference on Robotics and Automation (ICRA)(2022), pp. 10504–10510, doi:10.1109/ICRA46639.2022.9811564

work page doi:10.1109/icra46639.2022.9811564 2022
[28]

Zhang, D

R. Zhang, D. Zhang, M. W. Mueller, Proxfly: Robust control for close proximity quadcopter flight via residual reinforcement learning.arXiv preprint arXiv:2409.13193(2024)

work page arXiv 2024
[29]

J. Heeg, Y. Song, D. Scaramuzza, Learning quadrotor control from visual features using differentiable simulation.arXiv preprint arXiv:2410.15979(2024)

work page arXiv 2024
[30]

J. Xing, I. Geles, Y. Song, E. Aljalbout, D. Scaramuzza, Multi-task reinforcement learning for quadrotors.IEEE Robotics and Automation Letters(2024)

work page 2024
[31]

Gronauer, M

S. Gronauer, M. Kissel, L. Sacchetto, M. Korte, K. Diepold, Using simulation optimization to improve zero-shot policy transfer of quadrotors, in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE) (2022), pp. 10170–10176

work page 2022
[32]

Ferede, G

R. Ferede, G. de Croon, C. De Wagter, D. Izzo, End-to-end neural network based optimal quadcopter control.Robotics and Autonomous Systems172, 104588 (2024)

work page 2024
[33]

Ferede, C

R. Ferede, C. De Wagter, D. Izzo, G. C. De Croon, End-to-end reinforcement learning for time- optimal quadcopter flight, in2024 IEEE International Conference on Robotics and Automation (ICRA)(IEEE) (2024), pp. 6172–6177

work page 2024
[34]

Balandi, P

L. Balandi, P. Robuffo Giordano, M. Tognon, Acceleration-Based Inner-Loop Control and MPC for Aerial Robots: Advantages and Drawbacks, inEuropean Robotics Forum(Springer) (2025), pp. 75–80

work page 2025
[35]

S. M. Hegre, W. Rehberg, M. Kulkarni, K. Alexis, A Neural Network Mode for PX4 on Embedded Flight Controllers.arXiv preprint arXiv:2505.00432(2025)

work page arXiv 2025
[36]

Zhang,et al., A Learning-Based Quadcopter Controller With Extreme Adaptation.IEEE Transactions on Robotics41, 3948–3964 (2025), doi:10.1109/TRO.2025.3577037

D. Zhang,et al., A Learning-Based Quadcopter Controller With Extreme Adaptation.IEEE Transactions on Robotics41, 3948–3964 (2025), doi:10.1109/TRO.2025.3577037

work page doi:10.1109/tro.2025.3577037 2025
[37]

Henderson,et al., Deep reinforcement learning that matters, inProceedings of the AAAI conference on artificial intelligence, vol

P. Henderson,et al., Deep reinforcement learning that matters, inProceedings of the AAAI conference on artificial intelligence, vol. 32 (2018)

work page 2018
[38]

Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, J. Ba, Scalable trust-region method for deep rein- forcement learning using kronecker-factored approximation.Advances in neural information processing systems30(2017)

work page 2017
[39]

H. P. Van Hasselt, A. Guez, M. Hessel, V. Mnih, D. Silver, Learning values across many orders of magnitude.Advances in neural information processing systems29(2016)

work page 2016
[40]

W. C. Lewis II, M. Moll, L. E. Kavraki, How much do unstated problem constraints limit deep robotic reinforcement learning?arXiv preprint arXiv:1909.09282(2019)

work page arXiv 1909
[41]

Let's Play Again: Variability of Deep Reinforcement Learning Agents in Atari Environments

K. Clary, E. Tosch, J. Foley, D. Jensen, Let’s play again: Variability of deep reinforcement learning agents in atari environments.arXiv preprint arXiv:1904.06312(2019). 25

work page internal anchor Pith review Pith/arXiv arXiv 1904
[42]

Agarwal, M

R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, M. Bellemare, Deep reinforcement learning at the edge of the statistical precipice.Advances in neural information processing systems34, 29304–29320 (2021)

work page 2021
[43]

T. Baca,et al., The MRS UA V system: Pushing the frontiers of reproducible research, real-world deployment, and education with autonomous unmanned aerial vehicles.Journal of Intelligent & Robotic Systems102(1), 26 (2021)

work page 2021
[44]

Dreher, T

J. Eschmann, D. Albani, G. Loianno, Data-Driven System Identification of Quadrotors Subject to Motor Delays, inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(2024), pp. 8095–8102, doi:10.1109/IROS58592.2024.10801441

work page doi:10.1109/iros58592.2024.10801441 2024
[45]

Understanding intermediate layers using linear classifier probes

G. Alain, Y. Bengio, Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[46]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy,et al., An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[47]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab,et al., Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Mahony, T

R. Mahony, T. Hamel, J.-M. Pflimlin, Nonlinear complementary filters on the special orthogonal group.IEEE Transactions on automatic control53(5), 1203–1218 (2008)

work page 2008
[49]

S. O. Madgwick,et al., An efficient orientation filter for inertial and inertial/magnetic sensor arrays (2010)

work page 2010
[50]

Materials and methods are available as supplementary material

work page
[51]

Kunapuli, J

P. Kunapuli, J. Welde, D. Jayaraman, V. Kumar, Leveling the Playing Field: Carefully Com- paring Classical and Learned Controllers for Quadrotor Trajectory Tracking, inProceedings of Robotics: Science and Systems(Los Angeles, United States of America) (2025)

work page 2025
[52]

S. Ross, B. Chaib-draa, J. Pineau, Bayes-adaptive pomdps.Advances in neural information processing systems20(2007)

work page 2007
[53]

Koller, N

D. Koller, N. Friedman,Probabilistic graphical models: principles and techniques(MIT press) (2009)

work page 2009
[54]

Pearl,Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference(Mor- gan Kaufmann Publishers Inc., San Francisco, CA, USA) (1988)

J. Pearl,Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference(Mor- gan Kaufmann Publishers Inc., San Francisco, CA, USA) (1988)

work page 1988
[55]

OpenAI,et al., Solving Rubik’s Cube with a Robot Hand (2019)

work page 2019
[56]

J. X. Wang,et al., Learning to reinforcement learn.arXiv preprint arXiv:1611.05763(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[57]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Y. Duan,et al., RL2: Fast reinforcement learning via slow reinforcement learning.arXiv preprint arXiv:1611.02779(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[58]

GPT-4 Technical Report

J. Achiam,et al., Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023). 26

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Belkin, D

M. Belkin, D. Hsu, S. Ma, S. Mandal, Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences116(32), 15849–15854 (2019)

work page 2019
[60]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

K. Cho,et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation.arXiv preprint arXiv:1406.1078(2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[61]

S. Ross, G. Gordon, D. Bagnell, A reduction of imitation learning and structured prediction to no-regret online learning, inProceedings of the fourteenth international conference on artificial intelligence and statistics(JMLR Workshop and Conference Proceedings) (2011), pp. 627–635

work page 2011
[62]

Moler, Matrix computation on distributed memory multiprocessors.Hypercube Multipro- cessors86(181-195), 31 (1986)

C. Moler, Matrix computation on distributed memory multiprocessors.Hypercube Multipro- cessors86(181-195), 31 (1986)

work page 1986
[64]

Supplementary Code and Data Repository, Github: rl-tools/raptor,https://github.com/ rl-tools/raptor

work page
[65]

Project Page, Static Website,https://raptor.rl.tools/

work page
[66]

Supplementary Video, YouTube,https://youtu.be/hVzdWRFTX3k

work page
[67]

Sarkka, A

S. Sarkka, A. Solin, J. Hartikainen, Spatiotemporal Learning via Infinite-Dimensional Bayesian Filtering and Smoothing: A Look at Gaussian Process Regression Through Kalman Filtering. IEEE Signal Processing Magazine30(4), 51–61 (2013), doi:10.1109/MSP.2013.2246292

work page doi:10.1109/msp.2013.2246292 2013
[68]

S ¨arkk¨a, A

S. S ¨arkk¨a, A. Solin,Applied stochastic differential equations, vol. 10 (Cambridge University Press) (2019)

work page 2019
[69]

Dota 2 with Large Scale Deep Reinforcement Learning

C. Berner,et al., Dota 2 with large scale deep reinforcement learning.arXiv preprint arXiv:1912.06680(2019). Acknowledgments We thank professor Van Anh Ho and Quang Ngoc Pham for letting us test the foundation policy on the soft quadrotor. Funding:This work was supported in part by the National Science Foundation (NSF) CAREER program under Grant 2145277, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.17096679 1912
[70]

The initialization of the recurrent state (value at step 0) is all zeros and we feed back the previous output of the policy as an input

Figure 7B shows the architecture which contains a dense input layer, Gated Recurrent Unit (GRU) layer and a dense output layer. The initialization of the recurrent state (value at step 0) is all zeros and we feed back the previous output of the policy as an input. Due to the small hidden dimensions S5 the foundation policy only has 2084 parameters: 𝑃=𝑃 in...

work page 2084

[1] [1]

Whatsize(number of parameters) does the recurrent neural network policy require to express this behavior? Can it run in hard real-time at high frequencies when deployed onsmall microcontrollers?

work page

[2] [2]

Will the policyforgetthe system dynamics after a short time?

Whatcontext windowis feasible? Recurrent neural networks are notoriously hard to train for sequences longer than 100−200 steps. Will the policyforgetthe system dynamics after a short time?

work page

[3] [3]

Does the policygeneralizeto unseen quadrotors that are 1) in-distribution and 2) out-of- distribution?

work page

[4] [4]

How muchtimeis required from activating the policy until it has gathered enough information to stably control the quadrotor? Is this feasible mid-flight, or would the quadrotor crash before the policy has identified the system properly?

work page

[5] [5]

Is there a trade-off between agility and adaptability? We tackle the question of feasibility 1) by devising a method to train such a foundation policy for quadrotor control, implementing it, and testing it on a range of real-world quadrotors. We tackle the size and inference speed question 2) by studying the scaling laws (13) in the student policy and by ...

work page 2084

[6] [6]

A quantitatively wide range of parameters: •Weight: 31.9 g - 2.4 kg •Size: 65 mm - 500 mm •Thrust-to-weight:≈1.75 - 12

work page

[7] [7]

This shows that our proposed RAPTOR framework actually produces a policy that not only generalizes to quadrotors that are in the training distribution (cf

A qualitatively diverse set of features: •Flight controller: PX4, Betaflight, Crazyflie, M5StampFly 10 •State estimator: EKF, Mahony, Madgwick •Motor type: brushed and brushless •Flexible frame •Mixing two- and three-blade propellers Many of these quantities are (far) out-of-distribution, like a thrust-to-weight ratio of 12 (≤5 in training), a flexible fr...

work page

[8] [8]

Switching from TD3 to SAC because we observed slightly more robust training dynamics in SAC

work page

[9] [9]

Training for longer to ensure convergence for all quadrotors

work page

[10] [10]

Adjusting the reward function, adding a penalty for termination and for the action derivative. 20

work page

[11] [11]

Removing the curriculum because we found that the changes to the reward function stabilize the training without the need for a curriculum

work page

[12] [12]

Ground-truth motor RPM states. The teacher policies are never deployed in reality, so instead of feeding a proprioceptive action history to account for the unobservable motor states as in (7), the teachers can directly observe the ground-truth motor states. This also makes the actor-critic architecture symmetric. We do these modifications to trade off wal...

work page

[13] [13]

Figure 7A), we design a Gated Recurrent Unit (GRU) (47)-based foundation policy architecture as displayed in Figure 7B

Due to the variable number of past steps (cf. Figure 7A), we design a Gated Recurrent Unit (GRU) (47)-based foundation policy architecture as displayed in Figure 7B. The relatively small hidden dimensionality of 16 is justified by the scaling experiments in Section 2.3. Due to the recurrence, the policy can theoretically ”access” all the previous observat...

work page

[14] [14]

G. Li, X. Liu, G. Loianno, Human-Aware Physical Human–Robot Collaborative Transportation and Manipulation With Multiple Aerial Robots.IEEE Transactions on Robotics41, 762–781 (2025), doi:10.1109/TRO.2024.3502508

work page doi:10.1109/tro.2024.3502508 2025

[15] [15]

A. Ollero,et al., The AEROARMS Project: Aerial Robots with Advanced Manipulation Ca- pabilities for Inspection and Maintenance.IEEE Robotics and Automation Magazine25(4), 12–23 (2018), doi:10.1109/MRA.2018.2852789

work page doi:10.1109/mra.2018.2852789 2018

[16] [16]

M. Tranzatto,et al., CERBERUS in the DARPA Subterranean Challenge.Science Robotics 7(66), eabp9742 (2022), doi:10.1126/scirobotics.abp9742,https://www.science.org/ doi/abs/10.1126/scirobotics.abp9742

work page doi:10.1126/scirobotics.abp9742 2022

[17] [17]

Y. Song, A. Romero, M. M¨ uller, V. Koltun, D. Scaramuzza, Reaching the limit in autonomous racing: Optimal control versus reinforcement learning.Science Robotics 8(82), eadg1462 (2023), doi:10.1126/scirobotics.adg1462,https://www.science.org/ doi/abs/10.1126/scirobotics.adg1462

work page doi:10.1126/scirobotics.adg1462 2023

[18] [18]

Champion-level drone racing using deep reinforcement learning,

E. Kaufmann,et al., Champion-level drone racing using deep reinforcement learning.Nature 620(7976), 982–987 (2023), doi:10.1038/s41586-023-06419-4

work page doi:10.1038/s41586-023-06419-4 2023

[19] [19]

Ferede, T

R. Ferede, T. Blaha, E. Lucassen, C. De Wagter, G. C. de Croon, One Net to Rule Them All: Domain Randomization in Quadcopter Racing Across Different Platforms.arXiv preprint arXiv:2504.21586(2025)

work page arXiv 2025

[20] [20]

Eschmann, D

J. Eschmann, D. Albani, G. Loianno, Learning to Fly in Seconds.IEEE Robotics and Automa- tion Letters9(7), 6336–6343 (2024), doi:10.1109/LRA.2024.3396025

work page doi:10.1109/lra.2024.3396025 2024

[21] [21]

X. B. Peng, M. Andrychowicz, W. Zaremba, P. Abbeel, Sim-to-Real Transfer of Robotic Control with Dynamics Randomization, inIEEE International Conference on Robotics and Automation (ICRA)(2018), pp. 3803–3810, doi:10.1109/ICRA.2018.8460528

work page doi:10.1109/icra.2018.8460528 2018

[22] [22]

Loquercio,et al., Deep drone racing: From simulation to reality with domain randomization

A. Loquercio,et al., Deep drone racing: From simulation to reality with domain randomization. IEEE Transactions on Robotics36(1), 1–14 (2019)

work page 2019

[23] [23]

Hanover,et al., Autonomous drone racing: A survey.IEEE Transactions on Robotics40, 3044–3067 (2024)

D. Hanover,et al., Autonomous drone racing: A survey.IEEE Transactions on Robotics40, 3044–3067 (2024)

work page 2024

[24] [24]

Radford,et al., Learning Transferable Visual Models From Natural Language Supervision, inProceedings of the 38th International Conference on Machine Learning, M

A. Radford,et al., Learning Transferable Visual Models From Natural Language Supervision, inProceedings of the 38th International Conference on Machine Learning, M. Meila, T. Zhang, Eds. (PMLR), vol. 139 ofProceedings of Machine Learning Research(2021), pp. 8748–8763, https://proceedings.mlr.press/v139/radford21a.html

work page 2021

[25] [25]

Brown,et al., Language models are few-shot learners.Advances in neural information processing systems33, 1877–1901 (2020)

T. Brown,et al., Language models are few-shot learners.Advances in neural information processing systems33, 1877–1901 (2020)

work page 1901

[26] [26]

Scaling Laws for Neural Language Models

J. Kaplan,et al., Scaling laws for neural language models.arXiv preprint arXiv:2001.08361 (2020). 24

work page internal anchor Pith review Pith/arXiv arXiv 2001

[27] [27]

Varadarajan, A

E. Kaufmann, L. Bauersfeld, D. Scaramuzza, A Benchmark Comparison of Learned Control Policies for Agile Quadrotor Flight, inInternational Conference on Robotics and Automation (ICRA)(2022), pp. 10504–10510, doi:10.1109/ICRA46639.2022.9811564

work page doi:10.1109/icra46639.2022.9811564 2022

[28] [28]

Zhang, D

R. Zhang, D. Zhang, M. W. Mueller, Proxfly: Robust control for close proximity quadcopter flight via residual reinforcement learning.arXiv preprint arXiv:2409.13193(2024)

work page arXiv 2024

[29] [29]

J. Heeg, Y. Song, D. Scaramuzza, Learning quadrotor control from visual features using differentiable simulation.arXiv preprint arXiv:2410.15979(2024)

work page arXiv 2024

[30] [30]

J. Xing, I. Geles, Y. Song, E. Aljalbout, D. Scaramuzza, Multi-task reinforcement learning for quadrotors.IEEE Robotics and Automation Letters(2024)

work page 2024

[31] [31]

Gronauer, M

S. Gronauer, M. Kissel, L. Sacchetto, M. Korte, K. Diepold, Using simulation optimization to improve zero-shot policy transfer of quadrotors, in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE) (2022), pp. 10170–10176

work page 2022

[32] [32]

Ferede, G

R. Ferede, G. de Croon, C. De Wagter, D. Izzo, End-to-end neural network based optimal quadcopter control.Robotics and Autonomous Systems172, 104588 (2024)

work page 2024

[33] [33]

Ferede, C

R. Ferede, C. De Wagter, D. Izzo, G. C. De Croon, End-to-end reinforcement learning for time- optimal quadcopter flight, in2024 IEEE International Conference on Robotics and Automation (ICRA)(IEEE) (2024), pp. 6172–6177

work page 2024

[34] [34]

Balandi, P

L. Balandi, P. Robuffo Giordano, M. Tognon, Acceleration-Based Inner-Loop Control and MPC for Aerial Robots: Advantages and Drawbacks, inEuropean Robotics Forum(Springer) (2025), pp. 75–80

work page 2025

[35] [35]

S. M. Hegre, W. Rehberg, M. Kulkarni, K. Alexis, A Neural Network Mode for PX4 on Embedded Flight Controllers.arXiv preprint arXiv:2505.00432(2025)

work page arXiv 2025

[36] [36]

Zhang,et al., A Learning-Based Quadcopter Controller With Extreme Adaptation.IEEE Transactions on Robotics41, 3948–3964 (2025), doi:10.1109/TRO.2025.3577037

D. Zhang,et al., A Learning-Based Quadcopter Controller With Extreme Adaptation.IEEE Transactions on Robotics41, 3948–3964 (2025), doi:10.1109/TRO.2025.3577037

work page doi:10.1109/tro.2025.3577037 2025

[37] [37]

Henderson,et al., Deep reinforcement learning that matters, inProceedings of the AAAI conference on artificial intelligence, vol

P. Henderson,et al., Deep reinforcement learning that matters, inProceedings of the AAAI conference on artificial intelligence, vol. 32 (2018)

work page 2018

[38] [38]

Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, J. Ba, Scalable trust-region method for deep rein- forcement learning using kronecker-factored approximation.Advances in neural information processing systems30(2017)

work page 2017

[39] [39]

H. P. Van Hasselt, A. Guez, M. Hessel, V. Mnih, D. Silver, Learning values across many orders of magnitude.Advances in neural information processing systems29(2016)

work page 2016

[40] [40]

W. C. Lewis II, M. Moll, L. E. Kavraki, How much do unstated problem constraints limit deep robotic reinforcement learning?arXiv preprint arXiv:1909.09282(2019)

work page arXiv 1909

[41] [41]

Let's Play Again: Variability of Deep Reinforcement Learning Agents in Atari Environments

K. Clary, E. Tosch, J. Foley, D. Jensen, Let’s play again: Variability of deep reinforcement learning agents in atari environments.arXiv preprint arXiv:1904.06312(2019). 25

work page internal anchor Pith review Pith/arXiv arXiv 1904

[42] [42]

Agarwal, M

R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, M. Bellemare, Deep reinforcement learning at the edge of the statistical precipice.Advances in neural information processing systems34, 29304–29320 (2021)

work page 2021

[43] [43]

T. Baca,et al., The MRS UA V system: Pushing the frontiers of reproducible research, real-world deployment, and education with autonomous unmanned aerial vehicles.Journal of Intelligent & Robotic Systems102(1), 26 (2021)

work page 2021

[44] [44]

Dreher, T

J. Eschmann, D. Albani, G. Loianno, Data-Driven System Identification of Quadrotors Subject to Motor Delays, inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(2024), pp. 8095–8102, doi:10.1109/IROS58592.2024.10801441

work page doi:10.1109/iros58592.2024.10801441 2024

[45] [45]

Understanding intermediate layers using linear classifier probes

G. Alain, Y. Bengio, Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[46] [46]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy,et al., An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010

[47] [47]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab,et al., Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Mahony, T

R. Mahony, T. Hamel, J.-M. Pflimlin, Nonlinear complementary filters on the special orthogonal group.IEEE Transactions on automatic control53(5), 1203–1218 (2008)

work page 2008

[49] [49]

S. O. Madgwick,et al., An efficient orientation filter for inertial and inertial/magnetic sensor arrays (2010)

work page 2010

[50] [50]

Materials and methods are available as supplementary material

work page

[51] [51]

Kunapuli, J

P. Kunapuli, J. Welde, D. Jayaraman, V. Kumar, Leveling the Playing Field: Carefully Com- paring Classical and Learned Controllers for Quadrotor Trajectory Tracking, inProceedings of Robotics: Science and Systems(Los Angeles, United States of America) (2025)

work page 2025

[52] [52]

S. Ross, B. Chaib-draa, J. Pineau, Bayes-adaptive pomdps.Advances in neural information processing systems20(2007)

work page 2007

[53] [53]

Koller, N

D. Koller, N. Friedman,Probabilistic graphical models: principles and techniques(MIT press) (2009)

work page 2009

[54] [54]

Pearl,Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference(Mor- gan Kaufmann Publishers Inc., San Francisco, CA, USA) (1988)

J. Pearl,Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference(Mor- gan Kaufmann Publishers Inc., San Francisco, CA, USA) (1988)

work page 1988

[55] [55]

OpenAI,et al., Solving Rubik’s Cube with a Robot Hand (2019)

work page 2019

[56] [56]

J. X. Wang,et al., Learning to reinforcement learn.arXiv preprint arXiv:1611.05763(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[57] [57]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Y. Duan,et al., RL2: Fast reinforcement learning via slow reinforcement learning.arXiv preprint arXiv:1611.02779(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[58] [58]

GPT-4 Technical Report

J. Achiam,et al., Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023). 26

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [59]

Belkin, D

M. Belkin, D. Hsu, S. Ma, S. Mandal, Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences116(32), 15849–15854 (2019)

work page 2019

[60] [60]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

K. Cho,et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation.arXiv preprint arXiv:1406.1078(2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[61] [61]

S. Ross, G. Gordon, D. Bagnell, A reduction of imitation learning and structured prediction to no-regret online learning, inProceedings of the fourteenth international conference on artificial intelligence and statistics(JMLR Workshop and Conference Proceedings) (2011), pp. 627–635

work page 2011

[62] [62]

Moler, Matrix computation on distributed memory multiprocessors.Hypercube Multipro- cessors86(181-195), 31 (1986)

C. Moler, Matrix computation on distributed memory multiprocessors.Hypercube Multipro- cessors86(181-195), 31 (1986)

work page 1986

[63] [64]

Supplementary Code and Data Repository, Github: rl-tools/raptor,https://github.com/ rl-tools/raptor

work page

[64] [65]

Project Page, Static Website,https://raptor.rl.tools/

work page

[65] [66]

Supplementary Video, YouTube,https://youtu.be/hVzdWRFTX3k

work page

[66] [67]

Sarkka, A

S. Sarkka, A. Solin, J. Hartikainen, Spatiotemporal Learning via Infinite-Dimensional Bayesian Filtering and Smoothing: A Look at Gaussian Process Regression Through Kalman Filtering. IEEE Signal Processing Magazine30(4), 51–61 (2013), doi:10.1109/MSP.2013.2246292

work page doi:10.1109/msp.2013.2246292 2013

[67] [68]

S ¨arkk¨a, A

S. S ¨arkk¨a, A. Solin,Applied stochastic differential equations, vol. 10 (Cambridge University Press) (2019)

work page 2019

[68] [69]

Dota 2 with Large Scale Deep Reinforcement Learning

C. Berner,et al., Dota 2 with large scale deep reinforcement learning.arXiv preprint arXiv:1912.06680(2019). Acknowledgments We thank professor Van Anh Ho and Quang Ngoc Pham for letting us test the foundation policy on the soft quadrotor. Funding:This work was supported in part by the National Science Foundation (NSF) CAREER program under Grant 2145277, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.17096679 1912

[69] [70]

The initialization of the recurrent state (value at step 0) is all zeros and we feed back the previous output of the policy as an input

Figure 7B shows the architecture which contains a dense input layer, Gated Recurrent Unit (GRU) layer and a dense output layer. The initialization of the recurrent state (value at step 0) is all zeros and we feed back the previous output of the policy as an input. Due to the small hidden dimensions S5 the foundation policy only has 2084 parameters: 𝑃=𝑃 in...

work page 2084