pith. sign in

arxiv: 2509.11481 · v2 · submitted 2025-09-15 · 💻 cs.RO · cs.AI· cs.LG

RAPTOR: A Foundation Policy for Quadrotor Control

Pith reviewed 2026-05-18 17:23 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords quadrotor controlzero-shot adaptationmeta-imitation learningrecurrent policyreinforcement learningsim-to-real transferfoundation policyin-context learning
0
0 comments X

The pith

A tiny recurrent policy adapts zero-shot to many different quadrotors

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a single three-layer neural network with 2084 parameters can control quadrotors across wide hardware differences by learning to adapt from its own recent history. Training proceeds by sampling 1000 varied quadrotors in simulation, training a separate reinforcement learning teacher for each, and distilling all teachers into one student policy whose recurrence supports rapid in-context adjustment. This approach targets the brittleness of current robotic controllers that require system identification and retraining for even minor platform changes. If the claim holds, one policy could handle trajectory tracking, wind, and disturbances on new real quadrotors from 32 g to 2.4 kg with different motors, frames, propellers, and flight controllers.

Core claim

RAPTOR trains a foundation policy for quadrotor control by first creating 1000 specialized teacher policies through reinforcement learning on distinct simulated platforms, then distilling them into one recurrent student policy. The student uses its hidden-layer recurrence to adapt its behavior within milliseconds to unseen real quadrotors, achieving zero-shot transfer without online adaptation or system identification.

What carries the argument

The recurrent hidden layer in the policy network that maintains internal state to support in-context learning from recent observations and actions.

If this is right

  • The same policy performs trajectory tracking on all tested real platforms without fine-tuning.
  • It maintains control under wind disturbances and physical poking.
  • Performance holds for both indoor and outdoor flights.
  • Adaptation completes in milliseconds, supporting real-time use across hardware types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation approach could produce foundation policies for other variable hardware robots such as manipulators.
  • Recurrent state might substitute for separate adaptive controllers in many robotic tasks.
  • Expanding the simulation distribution to include more environmental factors would test broader generalization.

Load-bearing premise

The 1000 sampled quadrotors in simulation capture enough real-world variation in motor response, frame flexibility, propeller aerodynamics, and controller latency for the distilled policy to transfer directly.

What would settle it

A real quadrotor whose motor curves, frame stiffness, or latency fall outside the range represented in the 1000 simulated samples would cause the policy to lose stability or fail at trajectory tracking.

Figures

Figures reproduced from arXiv: 2509.11481 by Dario Albani, Giuseppe Loianno, Jonas Eschmann.

Figure 3
Figure 3. Figure 3: Here we start a quadrotor in a state where it is displaced by 2 m from the target position in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car. In contrast, modern robotic control systems, like neural network policies trained using Reinforcement Learning (RL), are highly specialized for single environments. Because of this overfitting, they are known to break down even under small differences like the Simulation-to-Reality (Sim2Real) gap and require system identification and retraining for even minimal changes to the system. In this work, we present RAPTOR, a method for training a highly adaptive foundation policy for quadrotor control. Our method enables training a single, end-to-end neural-network policy to control a wide variety of quadrotors. We test 10 different real quadrotors from 32 g to 2.4 kg that also differ in motor type (brushed vs. brushless), frame type (soft vs. rigid), propeller type (2/3/4-blade), and flight controller (PX4/Betaflight/Crazyflie/M5StampFly). We find that a tiny, three-layer policy with only 2084 parameters is sufficient for zero-shot adaptation to a wide variety of platforms. The adaptation through in-context learning is made possible by using a recurrence in the hidden layer. The policy is trained through our proposed Meta-Imitation Learning algorithm, where we sample 1000 quadrotors and train a teacher policy for each of them using RL. Subsequently, the 1000 teachers are distilled into a single, adaptive student policy. We find that within milliseconds, the resulting foundation policy adapts zero-shot to unseen quadrotors. We extensively test the capabilities of the foundation policy under numerous conditions (trajectory tracking, indoor/outdoor, wind disturbance, poking, different propellers).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims to introduce RAPTOR, a method for training a small recurrent neural network policy (three layers, 2084 parameters) for quadrotor control using meta-imitation learning. RL teacher policies are trained for each of 1000 sampled simulated quadrotors and distilled into a single adaptive student policy. This policy is said to enable zero-shot adaptation to 10 real quadrotors varying in mass from 32g to 2.4kg, motor types (brushed/brushless), frame types (soft/rigid), propeller types (2/3/4-blade), and flight controllers (PX4/Betaflight/Crazyflie/M5StampFly). The adaptation is attributed to the recurrent hidden state allowing in-context learning, and the policy is tested on trajectory tracking, indoor/outdoor, wind disturbance, poking, and different propellers.

Significance. If the results hold, the work would be significant for demonstrating that a compact foundation policy can achieve broad zero-shot generalization across diverse real-world quadrotor hardware without retraining or system identification. This could reduce the engineering effort for deploying control policies on new platforms. The small parameter count is a strength, and the use of recurrence for adaptation is an interesting approach. The extensive testing on multiple real platforms under varied conditions provides a good starting point for validation, though quantitative details are needed to fully assess impact.

major comments (2)
  1. The abstract states successful zero-shot tests on 10 real platforms under varied conditions, but provides no quantitative metrics, baselines, error bars, or ablation results. This is load-bearing for the central claim of effective adaptation, as qualitative success alone does not substantiate the performance of the 2084-parameter policy.
  2. The sampling procedure for the 1000 quadrotors lacks specification of the parameter ranges and variance for dynamics properties like motor thrust curves, frame flexibility, propeller aerodynamics, and flight-controller latency. This information is necessary to evaluate whether the real-world test platforms represent genuine extrapolation beyond the simulated distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address each major comment point by point below, providing the strongest honest defense of the work while acknowledging where revisions are warranted to strengthen the presentation.

read point-by-point responses
  1. Referee: The abstract states successful zero-shot tests on 10 real platforms under varied conditions, but provides no quantitative metrics, baselines, error bars, or ablation results. This is load-bearing for the central claim of effective adaptation, as qualitative success alone does not substantiate the performance of the 2084-parameter policy.

    Authors: We agree that quantitative support is essential for the central claims. The full manuscript reports quantitative results including position and attitude tracking RMSE, success rates over repeated trials, and comparisons against non-recurrent baselines and ablations removing the recurrent state. To directly address the concern, we will revise the abstract to include key quantitative highlights (e.g., typical tracking errors and adaptation timescales) and ensure error bars from multiple runs plus ablation tables are clearly presented and referenced in the main text. revision: yes

  2. Referee: The sampling procedure for the 1000 quadrotors lacks specification of the parameter ranges and variance for dynamics properties like motor thrust curves, frame flexibility, propeller aerodynamics, and flight-controller latency. This information is necessary to evaluate whether the real-world test platforms represent genuine extrapolation beyond the simulated distribution.

    Authors: We acknowledge that the submitted version does not explicitly tabulate the full sampling ranges and variances in the main text. The manuscript describes sampling 1000 quadrotors but leaves the precise distributions for thrust curves, frame stiffness, propeller aerodynamics, and latency implicit. We will add a dedicated paragraph and summary table in the methods section (and expand the appendix) that specifies the uniform and Gaussian ranges used for each property. This revision will allow readers to assess how the real platforms relate to the training distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the meta-imitation learning pipeline

full rationale

The paper describes an empirical training procedure in which 1000 simulated quadrotors are each assigned an independent RL teacher policy, after which the teachers are distilled into a single recurrent student policy whose hidden state enables in-context adaptation. This pipeline does not contain any self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the reported zero-shot performance on real platforms back to the input sampling distribution by construction. The central result is an experimental outcome measured on ten distinct physical vehicles whose dynamics lie outside the training set, making the derivation self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that simulated quadrotor variations are representative of real hardware differences and that the recurrent hidden state suffices for in-context adaptation without explicit system identification.

free parameters (2)
  • number of sampled quadrotors
    1000 quadrotors are sampled to generate the teacher policies; the exact sampling distribution and parameter ranges are not specified.
  • policy architecture size
    The three-layer network is fixed at 2084 parameters; this size is a modeling choice that enables the reported adaptation.
axioms (1)
  • domain assumption Quadrotor dynamics in simulation are sufficiently accurate to produce teachers whose behavior transfers to real hardware via distillation.
    The entire meta-training pipeline depends on this transfer assumption.

pith-pipeline@v0.9.0 · 5856 in / 1305 out tokens · 43791 ms · 2026-05-18T17:23:30.123638+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    An RL-based outer-loop quadrotor controller augmented with an online Residual Dynamics Predictor for disturbance estimation and a data-efficient sim-to-real calibration bridge.

  2. Autonomous UAV Pipeline Near-proximity Inspection via Disturbance-Aware Predictive Visual Servoing

    cs.RO 2026-04 unverdicted novelty 5.0

    The ESKF-PRE-VMPC framework couples quadrotor dynamics with image-feature prediction and disturbance estimation to enable autonomous near-proximity pipeline inspection that outperforms baselines in straight, windy, an...

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 2 Pith papers · 10 internal anchors

  1. [1]

    Whatsize(number of parameters) does the recurrent neural network policy require to express this behavior? Can it run in hard real-time at high frequencies when deployed onsmall microcontrollers?

  2. [2]

    Will the policyforgetthe system dynamics after a short time?

    Whatcontext windowis feasible? Recurrent neural networks are notoriously hard to train for sequences longer than 100−200 steps. Will the policyforgetthe system dynamics after a short time?

  3. [3]

    Does the policygeneralizeto unseen quadrotors that are 1) in-distribution and 2) out-of- distribution?

  4. [4]

    How muchtimeis required from activating the policy until it has gathered enough information to stably control the quadrotor? Is this feasible mid-flight, or would the quadrotor crash before the policy has identified the system properly?

  5. [5]

    Is there a trade-off between agility and adaptability? We tackle the question of feasibility 1) by devising a method to train such a foundation policy for quadrotor control, implementing it, and testing it on a range of real-world quadrotors. We tackle the size and inference speed question 2) by studying the scaling laws (13) in the student policy and by ...

  6. [6]

    A quantitatively wide range of parameters: •Weight: 31.9 g - 2.4 kg •Size: 65 mm - 500 mm •Thrust-to-weight:≈1.75 - 12

  7. [7]

    This shows that our proposed RAPTOR framework actually produces a policy that not only generalizes to quadrotors that are in the training distribution (cf

    A qualitatively diverse set of features: •Flight controller: PX4, Betaflight, Crazyflie, M5StampFly 10 •State estimator: EKF, Mahony, Madgwick •Motor type: brushed and brushless •Flexible frame •Mixing two- and three-blade propellers Many of these quantities are (far) out-of-distribution, like a thrust-to-weight ratio of 12 (≤5 in training), a flexible fr...

  8. [8]

    Switching from TD3 to SAC because we observed slightly more robust training dynamics in SAC

  9. [9]

    Training for longer to ensure convergence for all quadrotors

  10. [10]

    Adjusting the reward function, adding a penalty for termination and for the action derivative. 20

  11. [11]

    Removing the curriculum because we found that the changes to the reward function stabilize the training without the need for a curriculum

  12. [12]

    Ground-truth motor RPM states. The teacher policies are never deployed in reality, so instead of feeding a proprioceptive action history to account for the unobservable motor states as in (7), the teachers can directly observe the ground-truth motor states. This also makes the actor-critic architecture symmetric. We do these modifications to trade off wal...

  13. [13]

    Figure 7A), we design a Gated Recurrent Unit (GRU) (47)-based foundation policy architecture as displayed in Figure 7B

    Due to the variable number of past steps (cf. Figure 7A), we design a Gated Recurrent Unit (GRU) (47)-based foundation policy architecture as displayed in Figure 7B. The relatively small hidden dimensionality of 16 is justified by the scaling experiments in Section 2.3. Due to the recurrence, the policy can theoretically ”access” all the previous observat...

  14. [14]

    G. Li, X. Liu, G. Loianno, Human-Aware Physical Human–Robot Collaborative Transportation and Manipulation With Multiple Aerial Robots.IEEE Transactions on Robotics41, 762–781 (2025), doi:10.1109/TRO.2024.3502508

  15. [15]

    A. Ollero,et al., The AEROARMS Project: Aerial Robots with Advanced Manipulation Ca- pabilities for Inspection and Maintenance.IEEE Robotics and Automation Magazine25(4), 12–23 (2018), doi:10.1109/MRA.2018.2852789

  16. [16]

    M. Tranzatto,et al., CERBERUS in the DARPA Subterranean Challenge.Science Robotics 7(66), eabp9742 (2022), doi:10.1126/scirobotics.abp9742,https://www.science.org/ doi/abs/10.1126/scirobotics.abp9742

  17. [17]

    Y. Song, A. Romero, M. M¨ uller, V. Koltun, D. Scaramuzza, Reaching the limit in autonomous racing: Optimal control versus reinforcement learning.Science Robotics 8(82), eadg1462 (2023), doi:10.1126/scirobotics.adg1462,https://www.science.org/ doi/abs/10.1126/scirobotics.adg1462

  18. [18]

    Champion-level drone racing using deep reinforcement learning,

    E. Kaufmann,et al., Champion-level drone racing using deep reinforcement learning.Nature 620(7976), 982–987 (2023), doi:10.1038/s41586-023-06419-4

  19. [19]

    Ferede, T

    R. Ferede, T. Blaha, E. Lucassen, C. De Wagter, G. C. de Croon, One Net to Rule Them All: Domain Randomization in Quadcopter Racing Across Different Platforms.arXiv preprint arXiv:2504.21586(2025)

  20. [20]

    Eschmann, D

    J. Eschmann, D. Albani, G. Loianno, Learning to Fly in Seconds.IEEE Robotics and Automa- tion Letters9(7), 6336–6343 (2024), doi:10.1109/LRA.2024.3396025

  21. [21]

    X. B. Peng, M. Andrychowicz, W. Zaremba, P. Abbeel, Sim-to-Real Transfer of Robotic Control with Dynamics Randomization, inIEEE International Conference on Robotics and Automation (ICRA)(2018), pp. 3803–3810, doi:10.1109/ICRA.2018.8460528

  22. [22]

    Loquercio,et al., Deep drone racing: From simulation to reality with domain randomization

    A. Loquercio,et al., Deep drone racing: From simulation to reality with domain randomization. IEEE Transactions on Robotics36(1), 1–14 (2019)

  23. [23]

    Hanover,et al., Autonomous drone racing: A survey.IEEE Transactions on Robotics40, 3044–3067 (2024)

    D. Hanover,et al., Autonomous drone racing: A survey.IEEE Transactions on Robotics40, 3044–3067 (2024)

  24. [24]

    Radford,et al., Learning Transferable Visual Models From Natural Language Supervision, inProceedings of the 38th International Conference on Machine Learning, M

    A. Radford,et al., Learning Transferable Visual Models From Natural Language Supervision, inProceedings of the 38th International Conference on Machine Learning, M. Meila, T. Zhang, Eds. (PMLR), vol. 139 ofProceedings of Machine Learning Research(2021), pp. 8748–8763, https://proceedings.mlr.press/v139/radford21a.html

  25. [25]

    Brown,et al., Language models are few-shot learners.Advances in neural information processing systems33, 1877–1901 (2020)

    T. Brown,et al., Language models are few-shot learners.Advances in neural information processing systems33, 1877–1901 (2020)

  26. [26]

    Scaling Laws for Neural Language Models

    J. Kaplan,et al., Scaling laws for neural language models.arXiv preprint arXiv:2001.08361 (2020). 24

  27. [27]

    Varadarajan, A

    E. Kaufmann, L. Bauersfeld, D. Scaramuzza, A Benchmark Comparison of Learned Control Policies for Agile Quadrotor Flight, inInternational Conference on Robotics and Automation (ICRA)(2022), pp. 10504–10510, doi:10.1109/ICRA46639.2022.9811564

  28. [28]

    Zhang, D

    R. Zhang, D. Zhang, M. W. Mueller, Proxfly: Robust control for close proximity quadcopter flight via residual reinforcement learning.arXiv preprint arXiv:2409.13193(2024)

  29. [29]

    J. Heeg, Y. Song, D. Scaramuzza, Learning quadrotor control from visual features using differentiable simulation.arXiv preprint arXiv:2410.15979(2024)

  30. [30]

    J. Xing, I. Geles, Y. Song, E. Aljalbout, D. Scaramuzza, Multi-task reinforcement learning for quadrotors.IEEE Robotics and Automation Letters(2024)

  31. [31]

    Gronauer, M

    S. Gronauer, M. Kissel, L. Sacchetto, M. Korte, K. Diepold, Using simulation optimization to improve zero-shot policy transfer of quadrotors, in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE) (2022), pp. 10170–10176

  32. [32]

    Ferede, G

    R. Ferede, G. de Croon, C. De Wagter, D. Izzo, End-to-end neural network based optimal quadcopter control.Robotics and Autonomous Systems172, 104588 (2024)

  33. [33]

    Ferede, C

    R. Ferede, C. De Wagter, D. Izzo, G. C. De Croon, End-to-end reinforcement learning for time- optimal quadcopter flight, in2024 IEEE International Conference on Robotics and Automation (ICRA)(IEEE) (2024), pp. 6172–6177

  34. [34]

    Balandi, P

    L. Balandi, P. Robuffo Giordano, M. Tognon, Acceleration-Based Inner-Loop Control and MPC for Aerial Robots: Advantages and Drawbacks, inEuropean Robotics Forum(Springer) (2025), pp. 75–80

  35. [35]

    S. M. Hegre, W. Rehberg, M. Kulkarni, K. Alexis, A Neural Network Mode for PX4 on Embedded Flight Controllers.arXiv preprint arXiv:2505.00432(2025)

  36. [36]

    Zhang,et al., A Learning-Based Quadcopter Controller With Extreme Adaptation.IEEE Transactions on Robotics41, 3948–3964 (2025), doi:10.1109/TRO.2025.3577037

    D. Zhang,et al., A Learning-Based Quadcopter Controller With Extreme Adaptation.IEEE Transactions on Robotics41, 3948–3964 (2025), doi:10.1109/TRO.2025.3577037

  37. [37]

    Henderson,et al., Deep reinforcement learning that matters, inProceedings of the AAAI conference on artificial intelligence, vol

    P. Henderson,et al., Deep reinforcement learning that matters, inProceedings of the AAAI conference on artificial intelligence, vol. 32 (2018)

  38. [38]

    Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, J. Ba, Scalable trust-region method for deep rein- forcement learning using kronecker-factored approximation.Advances in neural information processing systems30(2017)

  39. [39]

    H. P. Van Hasselt, A. Guez, M. Hessel, V. Mnih, D. Silver, Learning values across many orders of magnitude.Advances in neural information processing systems29(2016)

  40. [40]

    W. C. Lewis II, M. Moll, L. E. Kavraki, How much do unstated problem constraints limit deep robotic reinforcement learning?arXiv preprint arXiv:1909.09282(2019)

  41. [41]

    Let's Play Again: Variability of Deep Reinforcement Learning Agents in Atari Environments

    K. Clary, E. Tosch, J. Foley, D. Jensen, Let’s play again: Variability of deep reinforcement learning agents in atari environments.arXiv preprint arXiv:1904.06312(2019). 25

  42. [42]

    Agarwal, M

    R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, M. Bellemare, Deep reinforcement learning at the edge of the statistical precipice.Advances in neural information processing systems34, 29304–29320 (2021)

  43. [43]

    T. Baca,et al., The MRS UA V system: Pushing the frontiers of reproducible research, real-world deployment, and education with autonomous unmanned aerial vehicles.Journal of Intelligent & Robotic Systems102(1), 26 (2021)

  44. [44]

    Dreher, T

    J. Eschmann, D. Albani, G. Loianno, Data-Driven System Identification of Quadrotors Subject to Motor Delays, inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(2024), pp. 8095–8102, doi:10.1109/IROS58592.2024.10801441

  45. [45]

    Understanding intermediate layers using linear classifier probes

    G. Alain, Y. Bengio, Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644(2016)

  46. [46]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy,et al., An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

  47. [47]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab,et al., Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193(2023)

  48. [48]

    Mahony, T

    R. Mahony, T. Hamel, J.-M. Pflimlin, Nonlinear complementary filters on the special orthogonal group.IEEE Transactions on automatic control53(5), 1203–1218 (2008)

  49. [49]

    S. O. Madgwick,et al., An efficient orientation filter for inertial and inertial/magnetic sensor arrays (2010)

  50. [50]

    Materials and methods are available as supplementary material

  51. [51]

    Kunapuli, J

    P. Kunapuli, J. Welde, D. Jayaraman, V. Kumar, Leveling the Playing Field: Carefully Com- paring Classical and Learned Controllers for Quadrotor Trajectory Tracking, inProceedings of Robotics: Science and Systems(Los Angeles, United States of America) (2025)

  52. [52]

    S. Ross, B. Chaib-draa, J. Pineau, Bayes-adaptive pomdps.Advances in neural information processing systems20(2007)

  53. [53]

    Koller, N

    D. Koller, N. Friedman,Probabilistic graphical models: principles and techniques(MIT press) (2009)

  54. [54]

    Pearl,Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference(Mor- gan Kaufmann Publishers Inc., San Francisco, CA, USA) (1988)

    J. Pearl,Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference(Mor- gan Kaufmann Publishers Inc., San Francisco, CA, USA) (1988)

  55. [55]

    OpenAI,et al., Solving Rubik’s Cube with a Robot Hand (2019)

  56. [56]

    J. X. Wang,et al., Learning to reinforcement learn.arXiv preprint arXiv:1611.05763(2016)

  57. [57]

    RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

    Y. Duan,et al., RL2: Fast reinforcement learning via slow reinforcement learning.arXiv preprint arXiv:1611.02779(2016)

  58. [58]

    GPT-4 Technical Report

    J. Achiam,et al., Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023). 26

  59. [59]

    Belkin, D

    M. Belkin, D. Hsu, S. Ma, S. Mandal, Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences116(32), 15849–15854 (2019)

  60. [60]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    K. Cho,et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation.arXiv preprint arXiv:1406.1078(2014)

  61. [61]

    S. Ross, G. Gordon, D. Bagnell, A reduction of imitation learning and structured prediction to no-regret online learning, inProceedings of the fourteenth international conference on artificial intelligence and statistics(JMLR Workshop and Conference Proceedings) (2011), pp. 627–635

  62. [62]

    Moler, Matrix computation on distributed memory multiprocessors.Hypercube Multipro- cessors86(181-195), 31 (1986)

    C. Moler, Matrix computation on distributed memory multiprocessors.Hypercube Multipro- cessors86(181-195), 31 (1986)

  63. [64]

    Supplementary Code and Data Repository, Github: rl-tools/raptor,https://github.com/ rl-tools/raptor

  64. [65]

    Project Page, Static Website,https://raptor.rl.tools/

  65. [66]

    Supplementary Video, YouTube,https://youtu.be/hVzdWRFTX3k

  66. [67]

    Sarkka, A

    S. Sarkka, A. Solin, J. Hartikainen, Spatiotemporal Learning via Infinite-Dimensional Bayesian Filtering and Smoothing: A Look at Gaussian Process Regression Through Kalman Filtering. IEEE Signal Processing Magazine30(4), 51–61 (2013), doi:10.1109/MSP.2013.2246292

  67. [68]

    S ¨arkk¨a, A

    S. S ¨arkk¨a, A. Solin,Applied stochastic differential equations, vol. 10 (Cambridge University Press) (2019)

  68. [69]

    Dota 2 with Large Scale Deep Reinforcement Learning

    C. Berner,et al., Dota 2 with large scale deep reinforcement learning.arXiv preprint arXiv:1912.06680(2019). Acknowledgments We thank professor Van Anh Ho and Quang Ngoc Pham for letting us test the foundation policy on the soft quadrotor. Funding:This work was supported in part by the National Science Foundation (NSF) CAREER program under Grant 2145277, ...

  69. [70]

    The initialization of the recurrent state (value at step 0) is all zeros and we feed back the previous output of the policy as an input

    Figure 7B shows the architecture which contains a dense input layer, Gated Recurrent Unit (GRU) layer and a dense output layer. The initialization of the recurrent state (value at step 0) is all zeros and we feed back the previous output of the policy as an input. Due to the small hidden dimensions S5 the foundation policy only has 2084 parameters: 𝑃=𝑃 in...