Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning

Dileep Kalathil; Moble Benedict; Sushil Vemuri; Vishnu Saj

arxiv: 2605.16015 · v2 · pith:PQT4RJ2Onew · submitted 2026-05-15 · 💻 cs.RO · cs.LG

Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning

Vishnu Saj , Sushil Vemuri , Dileep Kalathil , Moble Benedict This is my paper

Pith reviewed 2026-05-20 18:26 UTC · model grok-4.3

classification 💻 cs.RO cs.LG

keywords quadrotor controlreinforcement learningadaptive controlsim-to-real transfertrajectory trackingdisturbance estimationresidual dynamicsslung load

0 comments

The pith

Replacing reliance on perfect simulation data with a Residual Dynamics Predictor lets a reinforcement learning outer-loop policy maintain precise quadrotor trajectory tracking under real-world mass changes, asymmetric payloads, and dynamic

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Quadrotors encounter unpredictable external forces from shifting mass, uneven loads, or swinging payloads that cause standard controllers to lose accuracy. The paper develops an adaptive outer-loop architecture that first trains an optimal policy in simulation and then substitutes ground-truth disturbance information with a learned Residual Dynamics Predictor. This predictor estimates the instantaneous external forces and moments acting on the vehicle using only the recent history of its states and control inputs. A short linear calibration step and online thrust correction align the simulation model to the physical aircraft with seconds of flight data. If the approach holds, small drones could execute reliable trajectories without extra sensors or overly cautious policies that sacrifice performance.

Core claim

The paper establishes that an outer-loop reinforcement learning policy, augmented by a Residual Dynamics Predictor that infers external forces and moments online from state-action history alone, combined with a data-efficient calibration bridge and thrust correction, transfers successfully to hardware and outperforms baseline controllers in maintaining precise trajectory tracking on a Crazyflie quadrotor under mass variations, asymmetric payloads, and dynamic slung loads.

What carries the argument

The Residual Dynamics Predictor, which estimates instantaneous external forces and moments from the recent history of states and control actions without direct sensing.

If this is right

The same outer-loop policy plus predictor structure can handle multiple classes of uncertainty without retraining the core policy.
Hardware transfer requires only seconds of flight data rather than extensive fine-tuning or additional instrumentation.
Trajectory tracking remains precise even when payloads are asymmetric or change dynamically during flight.
The approach avoids the conservatism that arises when domain randomization alone is used to prepare for unknown disturbances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar predictor-based adaptation could be applied to other rotorcraft or fixed-wing vehicles facing comparable disturbance regimes.
The online force estimates might be logged to detect gradual changes in vehicle dynamics that signal the need for maintenance.
Extending the calibration bridge to include environmental factors such as wind could further improve outdoor performance.
The method might reduce the sensor payload required for robust autonomous flight in uncertain conditions.

Load-bearing premise

The Residual Dynamics Predictor can accurately estimate the current external forces and moments acting on the quadrotor using only past states and control inputs without additional sensors or hardware.

What would settle it

Flight tests in which a dynamic slung load is introduced and the Residual Dynamics Predictor produces force estimates that lead to trajectory tracking errors exceeding those of standard domain-randomization baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.16015 by Dileep Kalathil, Moble Benedict, Sushil Vemuri, Vishnu Saj.

**Figure 2.** Figure 2: Schematic of the quadrotor body frame and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The Bitcraze Crazyflie 2.X micro-quadrotor [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Block diagram of the proposed cascaded control [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Real-time estimation of the added payload mass [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Real-time estimation of the induced roll moment [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Real-time estimation of the induced pitch mo [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 6.** Figure 6: Comparison of the additional mass added pre [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 9.** Figure 9: Planar x–y trajectory tracking performance of the adaptive controller carrying a 4.7 g suspended payload attached via a thread of length equal to the arm length. The reference trajectory is a Lissajous figure-8 curve with decreasing time periods (T). As T decreases from 15 s to 3 s, the required velocities and accelerations increase significantly, inducing aggressive pendulum dynamics. Remarkably, the con… view at source ↗

**Figure 10.** Figure 10: Time-series of the estimated disturbance quantities—vertical force ( [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

read the original abstract

Deep Reinforcement Learning (DRL) for quadrotor flight control typically relies on Domain Randomization (DR) for sim-to-real transfer, resulting in overly conservative policies that struggle with dynamic disturbances. To overcome this, we propose a novel adaptive control architecture that actively perceives and reacts to instantaneous perturbations. First, we train an optimal outer-loop policy, then replace its reliance on ground-truth disturbance data with a Residual Dynamics Predictor (RDP). The RDP estimates the external forces and moments acting on the aircraft in flight online using only the history of states and control actions. For seamless hardware transfer, we introduce a data-efficient linear calibration bridge and an online thrust correction mechanism that align the simulated latent space with reality using mere seconds of flight data. Real-world validations on a Crazyflie micro-quadrotor demonstrate that our adaptive controller significantly outperforms baselines, maintaining precise trajectory tracking under severe uncertainties including mass variations, asymmetric payloads, and dynamic slung loads

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines an RL outer-loop policy with an online residual predictor and quick linear calibration to adapt quadrotor control to real disturbances, and the Crazyflie tests show it beats baselines under mass changes and slung loads.

read the letter

The main takeaway is that this work replaces ground-truth disturbance inputs in an RL quadrotor policy with a Residual Dynamics Predictor that pulls estimates from state-action history, then adds a data-efficient linear bridge and thrust correction to move from sim to a Crazyflie. The hardware trials indicate better trajectory tracking than non-adaptive baselines when mass varies, payloads are asymmetric, or loads swing dynamically. That setup directly targets the conservatism that comes from heavy domain randomization in drone RL. The calibration step using seconds of flight data is a practical detail that lowers the barrier to hardware use. The real-world results give concrete evidence that the architecture can respond to uncertainties without extra sensors. The soft spots sit in the validation. The abstract reports outperformance but gives no numbers, error bars, or training specifics for the RDP, so the size of the gains and the predictor's accuracy remain hard to judge from the summary alone. The concern about instantaneous estimates for fast slung-load disturbances is reasonable; recovering coupled time-varying forces from recent history alone could introduce lag or noise sensitivity on a micro-quadrotor, and without timing or ablation data it is difficult to confirm the bandwidth is sufficient. This paper is for people working on sim-to-real RL for small aerial robots who want an adaptive outer loop rather than more randomization. A reader focused on practical control architectures would pick up usable ideas from the integration and calibration. It deserves a serious referee because the hardware tests address a genuine limitation in current methods and the claims are falsifiable on the same platform.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an adaptive outer-loop control architecture for quadrotors that trains an optimal RL policy in simulation and replaces ground-truth disturbance inputs with a Residual Dynamics Predictor (RDP). The RDP estimates instantaneous external forces and moments online from state-action history alone. A linear calibration bridge and online thrust correction enable sim-to-real transfer with seconds of flight data. Real-world Crazyflie experiments claim superior trajectory tracking versus baselines under mass variation, asymmetric payloads, and dynamic slung loads.

Significance. If validated, the approach offers a practical route to reactive adaptation in RL-based quadrotor control without extra sensors, addressing limitations of domain randomization for dynamic disturbances. The data-efficient calibration and real-world slung-load results would be useful contributions to aerial robotics if the RDP estimation bandwidth and accuracy are rigorously demonstrated.

major comments (2)

[§4.2] §4.2 (RDP definition and training): The central claim that the RDP recovers instantaneous external forces/moments for fast-varying disturbances (e.g., dynamic slung loads) from state-action history alone is load-bearing. No quantitative results on estimation latency, bandwidth, or error during slung-load oscillation are provided; without these, it is unclear whether the predictor can invert the coupled residual dynamics at the required rate or whether lag undermines the reported adaptation advantage.
[Results] Results section, performance tables: The outperformance under asymmetric payloads and slung loads is asserted, yet the tables report only mean errors without standard deviations, trial counts, or statistical tests. This prevents assessment of whether the gains are robust or could be explained by trial-to-trial variability.

minor comments (2)

[Abstract] Abstract: The phrase 'significantly outperforms baselines' is used without any numerical values or error reductions; inserting one or two key quantitative results would make the claim concrete.
[Notation] Notation: The symbol for estimated disturbance in Eq. (5) is easily confused with the policy output; a distinct symbol or explicit reminder in the text would reduce reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below, agreeing where the manuscript can be strengthened and outlining specific revisions.

read point-by-point responses

Referee: [§4.2] §4.2 (RDP definition and training): The central claim that the RDP recovers instantaneous external forces/moments for fast-varying disturbances (e.g., dynamic slung loads) from state-action history alone is load-bearing. No quantitative results on estimation latency, bandwidth, or error during slung-load oscillation are provided; without these, it is unclear whether the predictor can invert the coupled residual dynamics at the required rate or whether lag undermines the reported adaptation advantage.

Authors: We agree that explicit quantitative characterization of the RDP is necessary to substantiate its suitability for fast-varying disturbances. The manuscript currently supports the claim indirectly via end-to-end closed-loop tracking performance under dynamic slung loads, but does not report per-timestep estimation error, latency, or bandwidth during oscillation. In the revision we will add these metrics to §4.2 (or a new appendix), including time-series comparisons of predicted versus measured residual forces/moments, a frequency-domain bandwidth estimate, and measured latency relative to the control loop rate. This addition will directly address whether lag is negligible at the operating frequency. revision: yes
Referee: [Results] Results section, performance tables: The outperformance under asymmetric payloads and slung loads is asserted, yet the tables report only mean errors without standard deviations, trial counts, or statistical tests. This prevents assessment of whether the gains are robust or could be explained by trial-to-trial variability.

Authors: The observation is correct; the present tables contain only mean errors. We will revise the results section to report standard deviations, the number of independent trials per condition (ten flights), and the outcomes of paired statistical tests (e.g., t-tests with p-values) comparing our controller against each baseline under asymmetric payload and slung-load conditions. These changes will allow readers to evaluate the statistical robustness of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent training and real-world validation

full rationale

The paper trains an outer-loop policy using ground-truth disturbances in simulation, then substitutes a separately trained Residual Dynamics Predictor (RDP) that maps state-action history to residual forces/moments. Real-world Crazyflie experiments under mass variation, payloads, and slung loads serve as external validation rather than a closed loop that reduces predictions to fitted inputs by construction. No self-definitional equations, load-bearing self-citations, or uniqueness theorems imported from prior author work are present in the abstract or described architecture. The RDP training and online calibration steps are presented as standard supervised learning followed by transfer, not as tautological renaming or forced equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the RDP is presented as a learned module whose internal assumptions are not stated.

pith-pipeline@v0.9.0 · 5700 in / 1070 out tokens · 28439 ms · 2026-05-20T18:26:25.230191+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The RDP estimates the external forces and moments acting on the aircraft in flight online using only the history of states and control actions.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate this system identification problem as a sequence-to-vector regression task... GRU layers... outputs the predicted 6D perturbation vector

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 6 internal anchors

[1]

Neurobem: Hybrid aerodynamic quadrotor model,

L. Bauersfeld, E. Kaufmann, P. Foehn, S. Sun, and D. Scaramuzza, “Neurobem: Hybrid aerodynamic quadrotor model,”arXiv preprint arXiv:2106.08015, 2021

work page arXiv 2021
[2]

Playing Atari with Deep Reinforcement Learning

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,”arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[3]

1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities, February 2026

K. Wang, I. Javali, M. Bortkiewicz, B. Eysenbach,et al., “1000 layer networks for self-supervised rl: Scaling depth can enable new goal-reaching capabilities,”arXiv preprint arXiv:2503.14858, 2025

work page arXiv 2025
[4]

Control of a quadrotor with reinforcement learning,

J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter, “Control of a quadrotor with reinforcement learning,” IEEE Robotics and Automation Letters, vol. 2, no. 4, pp. 2096–2103, 2017

work page 2096
[5]

Hy- brid reinforcement learning control for a micro quadro- tor flight,

J. Yoo, D. Jang, H. J. Kim, and K. H. Johansson, “Hy- brid reinforcement learning control for a micro quadro- tor flight,”IEEE Control Systems Letters, vol. 5, no. 2, pp. 505–510, 2020

work page 2020
[6]

Decentralized con- trol of quadrotor swarms with end-to-end deep rein- forcement learning,

S. Batra, Z. Huang, A. Petrenko, T. Kumar, A. Molchanov, and G. S. Sukhatme, “Decentralized con- trol of quadrotor swarms with end-to-end deep rein- forcement learning,” inConference on robot learning, pp. 576–586, PMLR, 2022

work page 2022
[7]

RMA: Rapid Motor Adaptation for Legged Robots

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,”arXiv preprint arXiv:2107.04034, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

A learning-based quadcopter con- troller with extreme adaptation,

D. Zhang, A. Loquercio, J. Tang, T.-H. Wang, J. Malik, and M. W. Mueller, “A learning-based quadcopter con- troller with extreme adaptation,”IEEE Transactions on Robotics, 2025. 11

work page 2025
[9]

RAPTOR: A Foundation Policy for Quadrotor Control

J. Eschmann, D. Albani, and G. Loianno, “Raptor: A foundation policy for quadrotor control,”arXiv preprint arXiv:2509.11481, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. An- war, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Crazyflie 2.1 nano quadcopter

Bitcraze, “Crazyflie 2.1 nano quadcopter.” https://www.bitcraze.io/products/ old-products/crazyflie-2-1/, 2024. Ac- cessed: 2026-04-15

work page 2024
[12]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Em- pirical evaluation of gated recurrent neural networks on sequence modeling,”arXiv preprint arXiv:1412.3555, 2014. 12

work page internal anchor Pith review Pith/arXiv arXiv 2014

[1] [1]

Neurobem: Hybrid aerodynamic quadrotor model,

L. Bauersfeld, E. Kaufmann, P. Foehn, S. Sun, and D. Scaramuzza, “Neurobem: Hybrid aerodynamic quadrotor model,”arXiv preprint arXiv:2106.08015, 2021

work page arXiv 2021

[2] [2]

Playing Atari with Deep Reinforcement Learning

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,”arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[3] [3]

1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities, February 2026

K. Wang, I. Javali, M. Bortkiewicz, B. Eysenbach,et al., “1000 layer networks for self-supervised rl: Scaling depth can enable new goal-reaching capabilities,”arXiv preprint arXiv:2503.14858, 2025

work page arXiv 2025

[4] [4]

Control of a quadrotor with reinforcement learning,

J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter, “Control of a quadrotor with reinforcement learning,” IEEE Robotics and Automation Letters, vol. 2, no. 4, pp. 2096–2103, 2017

work page 2096

[5] [5]

Hy- brid reinforcement learning control for a micro quadro- tor flight,

J. Yoo, D. Jang, H. J. Kim, and K. H. Johansson, “Hy- brid reinforcement learning control for a micro quadro- tor flight,”IEEE Control Systems Letters, vol. 5, no. 2, pp. 505–510, 2020

work page 2020

[6] [6]

Decentralized con- trol of quadrotor swarms with end-to-end deep rein- forcement learning,

S. Batra, Z. Huang, A. Petrenko, T. Kumar, A. Molchanov, and G. S. Sukhatme, “Decentralized con- trol of quadrotor swarms with end-to-end deep rein- forcement learning,” inConference on robot learning, pp. 576–586, PMLR, 2022

work page 2022

[7] [7]

RMA: Rapid Motor Adaptation for Legged Robots

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,”arXiv preprint arXiv:2107.04034, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

A learning-based quadcopter con- troller with extreme adaptation,

D. Zhang, A. Loquercio, J. Tang, T.-H. Wang, J. Malik, and M. W. Mueller, “A learning-based quadcopter con- troller with extreme adaptation,”IEEE Transactions on Robotics, 2025. 11

work page 2025

[9] [9]

RAPTOR: A Foundation Policy for Quadrotor Control

J. Eschmann, D. Albani, and G. Loianno, “Raptor: A foundation policy for quadrotor control,”arXiv preprint arXiv:2509.11481, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. An- war, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Crazyflie 2.1 nano quadcopter

Bitcraze, “Crazyflie 2.1 nano quadcopter.” https://www.bitcraze.io/products/ old-products/crazyflie-2-1/, 2024. Ac- cessed: 2026-04-15

work page 2024

[12] [12]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Em- pirical evaluation of gated recurrent neural networks on sequence modeling,”arXiv preprint arXiv:1412.3555, 2014. 12

work page internal anchor Pith review Pith/arXiv arXiv 2014