RAPTOR: A Foundation Policy for Quadrotor Control
Pith reviewed 2026-05-18 17:23 UTC · model grok-4.3
The pith
A tiny recurrent policy adapts zero-shot to many different quadrotors
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RAPTOR trains a foundation policy for quadrotor control by first creating 1000 specialized teacher policies through reinforcement learning on distinct simulated platforms, then distilling them into one recurrent student policy. The student uses its hidden-layer recurrence to adapt its behavior within milliseconds to unseen real quadrotors, achieving zero-shot transfer without online adaptation or system identification.
What carries the argument
The recurrent hidden layer in the policy network that maintains internal state to support in-context learning from recent observations and actions.
If this is right
- The same policy performs trajectory tracking on all tested real platforms without fine-tuning.
- It maintains control under wind disturbances and physical poking.
- Performance holds for both indoor and outdoor flights.
- Adaptation completes in milliseconds, supporting real-time use across hardware types.
Where Pith is reading between the lines
- The same distillation approach could produce foundation policies for other variable hardware robots such as manipulators.
- Recurrent state might substitute for separate adaptive controllers in many robotic tasks.
- Expanding the simulation distribution to include more environmental factors would test broader generalization.
Load-bearing premise
The 1000 sampled quadrotors in simulation capture enough real-world variation in motor response, frame flexibility, propeller aerodynamics, and controller latency for the distilled policy to transfer directly.
What would settle it
A real quadrotor whose motor curves, frame stiffness, or latency fall outside the range represented in the 1000 simulated samples would cause the policy to lose stability or fail at trajectory tracking.
Figures
read the original abstract
Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car. In contrast, modern robotic control systems, like neural network policies trained using Reinforcement Learning (RL), are highly specialized for single environments. Because of this overfitting, they are known to break down even under small differences like the Simulation-to-Reality (Sim2Real) gap and require system identification and retraining for even minimal changes to the system. In this work, we present RAPTOR, a method for training a highly adaptive foundation policy for quadrotor control. Our method enables training a single, end-to-end neural-network policy to control a wide variety of quadrotors. We test 10 different real quadrotors from 32 g to 2.4 kg that also differ in motor type (brushed vs. brushless), frame type (soft vs. rigid), propeller type (2/3/4-blade), and flight controller (PX4/Betaflight/Crazyflie/M5StampFly). We find that a tiny, three-layer policy with only 2084 parameters is sufficient for zero-shot adaptation to a wide variety of platforms. The adaptation through in-context learning is made possible by using a recurrence in the hidden layer. The policy is trained through our proposed Meta-Imitation Learning algorithm, where we sample 1000 quadrotors and train a teacher policy for each of them using RL. Subsequently, the 1000 teachers are distilled into a single, adaptive student policy. We find that within milliseconds, the resulting foundation policy adapts zero-shot to unseen quadrotors. We extensively test the capabilities of the foundation policy under numerous conditions (trajectory tracking, indoor/outdoor, wind disturbance, poking, different propellers).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce RAPTOR, a method for training a small recurrent neural network policy (three layers, 2084 parameters) for quadrotor control using meta-imitation learning. RL teacher policies are trained for each of 1000 sampled simulated quadrotors and distilled into a single adaptive student policy. This policy is said to enable zero-shot adaptation to 10 real quadrotors varying in mass from 32g to 2.4kg, motor types (brushed/brushless), frame types (soft/rigid), propeller types (2/3/4-blade), and flight controllers (PX4/Betaflight/Crazyflie/M5StampFly). The adaptation is attributed to the recurrent hidden state allowing in-context learning, and the policy is tested on trajectory tracking, indoor/outdoor, wind disturbance, poking, and different propellers.
Significance. If the results hold, the work would be significant for demonstrating that a compact foundation policy can achieve broad zero-shot generalization across diverse real-world quadrotor hardware without retraining or system identification. This could reduce the engineering effort for deploying control policies on new platforms. The small parameter count is a strength, and the use of recurrence for adaptation is an interesting approach. The extensive testing on multiple real platforms under varied conditions provides a good starting point for validation, though quantitative details are needed to fully assess impact.
major comments (2)
- The abstract states successful zero-shot tests on 10 real platforms under varied conditions, but provides no quantitative metrics, baselines, error bars, or ablation results. This is load-bearing for the central claim of effective adaptation, as qualitative success alone does not substantiate the performance of the 2084-parameter policy.
- The sampling procedure for the 1000 quadrotors lacks specification of the parameter ranges and variance for dynamics properties like motor thrust curves, frame flexibility, propeller aerodynamics, and flight-controller latency. This information is necessary to evaluate whether the real-world test platforms represent genuine extrapolation beyond the simulated distribution.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We address each major comment point by point below, providing the strongest honest defense of the work while acknowledging where revisions are warranted to strengthen the presentation.
read point-by-point responses
-
Referee: The abstract states successful zero-shot tests on 10 real platforms under varied conditions, but provides no quantitative metrics, baselines, error bars, or ablation results. This is load-bearing for the central claim of effective adaptation, as qualitative success alone does not substantiate the performance of the 2084-parameter policy.
Authors: We agree that quantitative support is essential for the central claims. The full manuscript reports quantitative results including position and attitude tracking RMSE, success rates over repeated trials, and comparisons against non-recurrent baselines and ablations removing the recurrent state. To directly address the concern, we will revise the abstract to include key quantitative highlights (e.g., typical tracking errors and adaptation timescales) and ensure error bars from multiple runs plus ablation tables are clearly presented and referenced in the main text. revision: yes
-
Referee: The sampling procedure for the 1000 quadrotors lacks specification of the parameter ranges and variance for dynamics properties like motor thrust curves, frame flexibility, propeller aerodynamics, and flight-controller latency. This information is necessary to evaluate whether the real-world test platforms represent genuine extrapolation beyond the simulated distribution.
Authors: We acknowledge that the submitted version does not explicitly tabulate the full sampling ranges and variances in the main text. The manuscript describes sampling 1000 quadrotors but leaves the precise distributions for thrust curves, frame stiffness, propeller aerodynamics, and latency implicit. We will add a dedicated paragraph and summary table in the methods section (and expand the appendix) that specifies the uniform and Gaussian ranges used for each property. This revision will allow readers to assess how the real platforms relate to the training distribution. revision: yes
Circularity Check
No significant circularity in the meta-imitation learning pipeline
full rationale
The paper describes an empirical training procedure in which 1000 simulated quadrotors are each assigned an independent RL teacher policy, after which the teachers are distilled into a single recurrent student policy whose hidden state enables in-context adaptation. This pipeline does not contain any self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the reported zero-shot performance on real platforms back to the input sampling distribution by construction. The central result is an experimental outcome measured on ten distinct physical vehicles whose dynamics lie outside the training set, making the derivation self-contained against external benchmarks rather than tautological.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of sampled quadrotors
- policy architecture size
axioms (1)
- domain assumption Quadrotor dynamics in simulation are sufficiently accurate to produce teachers whose behavior transfers to real hardware via distillation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Meta-Imitation Learning algorithm... sample 1000 quadrotors and train a teacher policy for each... distill into a single adaptive student policy... recurrence in the hidden layer
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
emergent implicit system identification... thrust-to-weight ratio... linear probe... R² of 0.949
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning
An RL-based outer-loop quadrotor controller augmented with an online Residual Dynamics Predictor for disturbance estimation and a data-efficient sim-to-real calibration bridge.
-
Autonomous UAV Pipeline Near-proximity Inspection via Disturbance-Aware Predictive Visual Servoing
The ESKF-PRE-VMPC framework couples quadrotor dynamics with image-feature prediction and disturbance estimation to enable autonomous near-proximity pipeline inspection that outperforms baselines in straight, windy, an...
Reference graph
Works this paper leans on
-
[1]
Whatsize(number of parameters) does the recurrent neural network policy require to express this behavior? Can it run in hard real-time at high frequencies when deployed onsmall microcontrollers?
-
[2]
Will the policyforgetthe system dynamics after a short time?
Whatcontext windowis feasible? Recurrent neural networks are notoriously hard to train for sequences longer than 100−200 steps. Will the policyforgetthe system dynamics after a short time?
-
[3]
Does the policygeneralizeto unseen quadrotors that are 1) in-distribution and 2) out-of- distribution?
-
[4]
How muchtimeis required from activating the policy until it has gathered enough information to stably control the quadrotor? Is this feasible mid-flight, or would the quadrotor crash before the policy has identified the system properly?
-
[5]
Is there a trade-off between agility and adaptability? We tackle the question of feasibility 1) by devising a method to train such a foundation policy for quadrotor control, implementing it, and testing it on a range of real-world quadrotors. We tackle the size and inference speed question 2) by studying the scaling laws (13) in the student policy and by ...
work page 2084
-
[6]
A quantitatively wide range of parameters: •Weight: 31.9 g - 2.4 kg •Size: 65 mm - 500 mm •Thrust-to-weight:≈1.75 - 12
-
[7]
A qualitatively diverse set of features: •Flight controller: PX4, Betaflight, Crazyflie, M5StampFly 10 •State estimator: EKF, Mahony, Madgwick •Motor type: brushed and brushless •Flexible frame •Mixing two- and three-blade propellers Many of these quantities are (far) out-of-distribution, like a thrust-to-weight ratio of 12 (≤5 in training), a flexible fr...
-
[8]
Switching from TD3 to SAC because we observed slightly more robust training dynamics in SAC
-
[9]
Training for longer to ensure convergence for all quadrotors
-
[10]
Adjusting the reward function, adding a penalty for termination and for the action derivative. 20
-
[11]
Removing the curriculum because we found that the changes to the reward function stabilize the training without the need for a curriculum
-
[12]
Ground-truth motor RPM states. The teacher policies are never deployed in reality, so instead of feeding a proprioceptive action history to account for the unobservable motor states as in (7), the teachers can directly observe the ground-truth motor states. This also makes the actor-critic architecture symmetric. We do these modifications to trade off wal...
-
[13]
Due to the variable number of past steps (cf. Figure 7A), we design a Gated Recurrent Unit (GRU) (47)-based foundation policy architecture as displayed in Figure 7B. The relatively small hidden dimensionality of 16 is justified by the scaling experiments in Section 2.3. Due to the recurrence, the policy can theoretically ”access” all the previous observat...
-
[14]
G. Li, X. Liu, G. Loianno, Human-Aware Physical Human–Robot Collaborative Transportation and Manipulation With Multiple Aerial Robots.IEEE Transactions on Robotics41, 762–781 (2025), doi:10.1109/TRO.2024.3502508
-
[15]
A. Ollero,et al., The AEROARMS Project: Aerial Robots with Advanced Manipulation Ca- pabilities for Inspection and Maintenance.IEEE Robotics and Automation Magazine25(4), 12–23 (2018), doi:10.1109/MRA.2018.2852789
-
[16]
M. Tranzatto,et al., CERBERUS in the DARPA Subterranean Challenge.Science Robotics 7(66), eabp9742 (2022), doi:10.1126/scirobotics.abp9742,https://www.science.org/ doi/abs/10.1126/scirobotics.abp9742
-
[17]
Y. Song, A. Romero, M. M¨ uller, V. Koltun, D. Scaramuzza, Reaching the limit in autonomous racing: Optimal control versus reinforcement learning.Science Robotics 8(82), eadg1462 (2023), doi:10.1126/scirobotics.adg1462,https://www.science.org/ doi/abs/10.1126/scirobotics.adg1462
-
[18]
Champion-level drone racing using deep reinforcement learning,
E. Kaufmann,et al., Champion-level drone racing using deep reinforcement learning.Nature 620(7976), 982–987 (2023), doi:10.1038/s41586-023-06419-4
- [19]
-
[20]
J. Eschmann, D. Albani, G. Loianno, Learning to Fly in Seconds.IEEE Robotics and Automa- tion Letters9(7), 6336–6343 (2024), doi:10.1109/LRA.2024.3396025
-
[21]
X. B. Peng, M. Andrychowicz, W. Zaremba, P. Abbeel, Sim-to-Real Transfer of Robotic Control with Dynamics Randomization, inIEEE International Conference on Robotics and Automation (ICRA)(2018), pp. 3803–3810, doi:10.1109/ICRA.2018.8460528
-
[22]
Loquercio,et al., Deep drone racing: From simulation to reality with domain randomization
A. Loquercio,et al., Deep drone racing: From simulation to reality with domain randomization. IEEE Transactions on Robotics36(1), 1–14 (2019)
work page 2019
-
[23]
Hanover,et al., Autonomous drone racing: A survey.IEEE Transactions on Robotics40, 3044–3067 (2024)
D. Hanover,et al., Autonomous drone racing: A survey.IEEE Transactions on Robotics40, 3044–3067 (2024)
work page 2024
-
[24]
A. Radford,et al., Learning Transferable Visual Models From Natural Language Supervision, inProceedings of the 38th International Conference on Machine Learning, M. Meila, T. Zhang, Eds. (PMLR), vol. 139 ofProceedings of Machine Learning Research(2021), pp. 8748–8763, https://proceedings.mlr.press/v139/radford21a.html
work page 2021
-
[25]
T. Brown,et al., Language models are few-shot learners.Advances in neural information processing systems33, 1877–1901 (2020)
work page 1901
-
[26]
Scaling Laws for Neural Language Models
J. Kaplan,et al., Scaling laws for neural language models.arXiv preprint arXiv:2001.08361 (2020). 24
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[27]
E. Kaufmann, L. Bauersfeld, D. Scaramuzza, A Benchmark Comparison of Learned Control Policies for Agile Quadrotor Flight, inInternational Conference on Robotics and Automation (ICRA)(2022), pp. 10504–10510, doi:10.1109/ICRA46639.2022.9811564
- [28]
- [29]
-
[30]
J. Xing, I. Geles, Y. Song, E. Aljalbout, D. Scaramuzza, Multi-task reinforcement learning for quadrotors.IEEE Robotics and Automation Letters(2024)
work page 2024
-
[31]
S. Gronauer, M. Kissel, L. Sacchetto, M. Korte, K. Diepold, Using simulation optimization to improve zero-shot policy transfer of quadrotors, in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE) (2022), pp. 10170–10176
work page 2022
- [32]
- [33]
-
[34]
L. Balandi, P. Robuffo Giordano, M. Tognon, Acceleration-Based Inner-Loop Control and MPC for Aerial Robots: Advantages and Drawbacks, inEuropean Robotics Forum(Springer) (2025), pp. 75–80
work page 2025
- [35]
-
[36]
D. Zhang,et al., A Learning-Based Quadcopter Controller With Extreme Adaptation.IEEE Transactions on Robotics41, 3948–3964 (2025), doi:10.1109/TRO.2025.3577037
-
[37]
P. Henderson,et al., Deep reinforcement learning that matters, inProceedings of the AAAI conference on artificial intelligence, vol. 32 (2018)
work page 2018
-
[38]
Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, J. Ba, Scalable trust-region method for deep rein- forcement learning using kronecker-factored approximation.Advances in neural information processing systems30(2017)
work page 2017
-
[39]
H. P. Van Hasselt, A. Guez, M. Hessel, V. Mnih, D. Silver, Learning values across many orders of magnitude.Advances in neural information processing systems29(2016)
work page 2016
- [40]
-
[41]
Let's Play Again: Variability of Deep Reinforcement Learning Agents in Atari Environments
K. Clary, E. Tosch, J. Foley, D. Jensen, Let’s play again: Variability of deep reinforcement learning agents in atari environments.arXiv preprint arXiv:1904.06312(2019). 25
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[42]
R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, M. Bellemare, Deep reinforcement learning at the edge of the statistical precipice.Advances in neural information processing systems34, 29304–29320 (2021)
work page 2021
-
[43]
T. Baca,et al., The MRS UA V system: Pushing the frontiers of reproducible research, real-world deployment, and education with autonomous unmanned aerial vehicles.Journal of Intelligent & Robotic Systems102(1), 26 (2021)
work page 2021
-
[44]
J. Eschmann, D. Albani, G. Loianno, Data-Driven System Identification of Quadrotors Subject to Motor Delays, inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(2024), pp. 8095–8102, doi:10.1109/IROS58592.2024.10801441
-
[45]
Understanding intermediate layers using linear classifier probes
G. Alain, Y. Bengio, Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644(2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[46]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy,et al., An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[47]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab,et al., Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [48]
-
[49]
S. O. Madgwick,et al., An efficient orientation filter for inertial and inertial/magnetic sensor arrays (2010)
work page 2010
-
[50]
Materials and methods are available as supplementary material
-
[51]
P. Kunapuli, J. Welde, D. Jayaraman, V. Kumar, Leveling the Playing Field: Carefully Com- paring Classical and Learned Controllers for Quadrotor Trajectory Tracking, inProceedings of Robotics: Science and Systems(Los Angeles, United States of America) (2025)
work page 2025
-
[52]
S. Ross, B. Chaib-draa, J. Pineau, Bayes-adaptive pomdps.Advances in neural information processing systems20(2007)
work page 2007
- [53]
-
[54]
J. Pearl,Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference(Mor- gan Kaufmann Publishers Inc., San Francisco, CA, USA) (1988)
work page 1988
-
[55]
OpenAI,et al., Solving Rubik’s Cube with a Robot Hand (2019)
work page 2019
-
[56]
J. X. Wang,et al., Learning to reinforcement learn.arXiv preprint arXiv:1611.05763(2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[57]
RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning
Y. Duan,et al., RL2: Fast reinforcement learning via slow reinforcement learning.arXiv preprint arXiv:1611.02779(2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[58]
J. Achiam,et al., Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023). 26
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [59]
-
[60]
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
K. Cho,et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation.arXiv preprint arXiv:1406.1078(2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[61]
S. Ross, G. Gordon, D. Bagnell, A reduction of imitation learning and structured prediction to no-regret online learning, inProceedings of the fourteenth international conference on artificial intelligence and statistics(JMLR Workshop and Conference Proceedings) (2011), pp. 627–635
work page 2011
-
[62]
C. Moler, Matrix computation on distributed memory multiprocessors.Hypercube Multipro- cessors86(181-195), 31 (1986)
work page 1986
-
[64]
Supplementary Code and Data Repository, Github: rl-tools/raptor,https://github.com/ rl-tools/raptor
-
[65]
Project Page, Static Website,https://raptor.rl.tools/
-
[66]
Supplementary Video, YouTube,https://youtu.be/hVzdWRFTX3k
-
[67]
S. Sarkka, A. Solin, J. Hartikainen, Spatiotemporal Learning via Infinite-Dimensional Bayesian Filtering and Smoothing: A Look at Gaussian Process Regression Through Kalman Filtering. IEEE Signal Processing Magazine30(4), 51–61 (2013), doi:10.1109/MSP.2013.2246292
-
[68]
S. S ¨arkk¨a, A. Solin,Applied stochastic differential equations, vol. 10 (Cambridge University Press) (2019)
work page 2019
-
[69]
Dota 2 with Large Scale Deep Reinforcement Learning
C. Berner,et al., Dota 2 with large scale deep reinforcement learning.arXiv preprint arXiv:1912.06680(2019). Acknowledgments We thank professor Van Anh Ho and Quang Ngoc Pham for letting us test the foundation policy on the soft quadrotor. Funding:This work was supported in part by the National Science Foundation (NSF) CAREER program under Grant 2145277, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.17096679 1912
-
[70]
Figure 7B shows the architecture which contains a dense input layer, Gated Recurrent Unit (GRU) layer and a dense output layer. The initialization of the recurrent state (value at step 0) is all zeros and we feed back the previous output of the policy as an input. Due to the small hidden dimensions S5 the foundation policy only has 2084 parameters: 𝑃=𝑃 in...
work page 2084
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.