pith. machine review for the scientific record. sign in

arxiv: 2604.19980 · v1 · submitted 2026-04-21 · 💻 cs.RO · cs.SY· eess.SY

Recognition: unknown

Efficient Reinforcement Learning using Linear Koopman Dynamics for Nonlinear Robotic Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:37 UTC · model grok-4.3

classification 💻 cs.RO cs.SYeess.SY
keywords reinforcement learningKoopman operatormodel-based RLnonlinear roboticsactor-criticpolicy optimizationlinear dynamicssample efficiency
0
0 comments X

The pith

Linear dynamics learned via the Koopman operator allow model-based reinforcement learning to optimize policies for nonlinear robots using only one-step predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a framework that learns a linear representation of nonlinear robotic dynamics by means of the Koopman operator and inserts that model into an actor-critic loop. Policy gradients are computed from single-step predictions of the learned dynamics rather than multi-step rollouts, so that policy updates can be performed online from streamed interaction data. A reader would care because the method promises to raise sample efficiency above standard model-free reinforcement learning while still reaching performance levels that classical controllers achieve only when they are given exact nonlinear models. The claim is supported by tests on simulated benchmarks plus physical experiments with a Kinova robotic arm and a Unitree quadruped.

Core claim

By lifting the original nonlinear state transitions into a linear system through the Koopman operator, the framework embeds the resulting model inside an actor-critic architecture so that policy gradients can be estimated reliably from one-step predictions. This produces an online mini-batch algorithm that improves policies from interaction data without incurring the compounding errors of long-horizon rollouts, and yields sample efficiency superior to model-free baselines together with control performance comparable to methods that assume perfect knowledge of the system dynamics.

What carries the argument

The Koopman operator that lifts nonlinear robotic dynamics into a linear form in a higher-dimensional space, allowing one-step forward predictions to supply stable gradients for policy optimization inside an actor-critic loop.

If this is right

  • Policy updates become feasible from streamed mini-batches without requiring simulated multi-step trajectories.
  • Control performance reaches levels previously available only to classical controllers supplied with exact nonlinear dynamics.
  • Sample efficiency exceeds that of typical model-free reinforcement learning on the same nonlinear benchmarks.
  • The same one-step gradient approach applies directly to real hardware platforms without hand-derived models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the linear lifting remains reliable, the same structure could be tried on other nonlinear systems such as vehicle dynamics or chemical processes where long rollouts are costly.
  • Better choices of lifting functions might further shrink model error and support modest extensions of the planning horizon while keeping the one-step update rule.
  • Scaling tests on higher-dimensional robots would show whether the efficiency gain persists when the lifted space grows larger.

Load-bearing premise

The learned linear model stays accurate enough across the states and actions seen during learning that one-step predictions produce consistent policy improvement rather than being overwhelmed by model error.

What would settle it

Running the learned controller on the physical arm or quadruped and finding that closed-loop performance falls below a model-free baseline or becomes unstable once the accumulated prediction error exceeds a small threshold.

Figures

Figures reproduced from arXiv: 2604.19980 by Shaoshuai Mou, Wenjian Hao, Yuxuan Fang, Zehui Lu.

Figure 1
Figure 1. Figure 1: PGDK-Online framework. B. The Proposed Framework We now introduce an online framework for approximating the optimal policy parameters θ µ∗ in (3). The framework contains four interacting components: (i) an online data collection module, (ii) a DKO module for dynamics approximation to obtain θ f∗ in (2), (iii) a critic module for cost value approximation, and (iv) an actor module for policy optimization. Th… view at source ↗
Figure 2
Figure 2. Figure 2: Learning curves of the learning-based methods on the [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Learning curves for the benchmark tasks, where a higher reward indicates better control performance. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Trajectories from PGDK and LQR. Notably, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Gazebo simulation of the Kinova Gen3 Lite for goal [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Learning curves and final goal-tracking errors over [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Experiments on the Kinova Gen3 Lite robotic arm. The task is performed with a fixed initial state (blue rounded [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Learning curves and final goal-tracking errors for the [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Learning curves. The solid line denotes the mean over [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Experiments on the Unitree Go1 quadruped robot. Three distinct initial states (blue rounded rectangles) and a common [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
read the original abstract

This paper presents a model-based reinforcement learning (RL) framework for optimal closed-loop control of nonlinear robotic systems. The proposed approach learns linear lifted dynamics through Koopman operator theory and integrates the resulting model into an actor-critic architecture for policy optimization, where the policy represents a parameterized closed-loop controller. To reduce computational cost and mitigate model rollout errors, policy gradients are estimated using one-step predictions of the learned dynamics rather than multi-step propagation. This leads to an online mini-batch policy gradient framework that enables policy improvement from streamed interaction data. The proposed framework is evaluated on several simulated nonlinear control benchmarks and two real-world hardware platforms, including a Kinova Gen3 robotic arm and a Unitree Go1 quadruped. Experimental results demonstrate improved sample efficiency over model-free RL baselines, superior control performance relative to model-based RL baselines, and control performance comparable to classical model-based methods that rely on exact system dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a model-based RL framework for nonlinear robotic systems that learns linear lifted dynamics via Koopman operator theory and embeds one-step predictions from this model into an actor-critic policy optimization loop. Policy gradients are computed from single-step model predictions rather than multi-step rollouts to reduce computation and error accumulation. The approach is evaluated on simulated nonlinear control tasks plus real hardware (Kinova Gen3 arm and Unitree Go1 quadruped), with claims of improved sample efficiency versus model-free baselines, better performance than other model-based RL methods, and parity with classical controllers that use exact dynamics.

Significance. If the central claims hold after addressing validation gaps, the work would demonstrate a practical route to sample-efficient model-based RL for high-dimensional nonlinear robots by exploiting Koopman lifting to obtain linear dynamics suitable for one-step gradient estimation. This could be valuable in robotics where exact models are unavailable and multi-step model rollouts are unreliable, provided the one-step approximation remains faithful on-policy.

major comments (2)
  1. [Method (one-step policy gradient estimation) and Experiments] The core assumption that one-step predictions from the learned Koopman model suffice for unbiased policy gradients (without model error dominating) is load-bearing for all performance claims, yet no quantitative one-step prediction error bounds, on-policy error analysis, or ablation isolating model mismatch effects on gradient direction are provided in the method or experiments sections.
  2. [Experiments] Experimental results claim superior sample efficiency and control performance, but the manuscript provides no details on model fitting procedure, hyperparameter selection, statistical significance tests, or data exclusion criteria, which prevents confirmation that reported gains are robust (as noted in the abstract's favorable comparisons).
minor comments (2)
  1. [Method] Clarify the precise definition of the lifted state space and the Koopman operator approximation method (e.g., which dictionary functions or EDMD variant is used) to aid reproducibility.
  2. [Experiments] Figure captions for hardware experiments should explicitly state the number of trials, seeds, and whether results are mean ± std.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the validation of our claims.

read point-by-point responses
  1. Referee: [Method (one-step policy gradient estimation) and Experiments] The core assumption that one-step predictions from the learned Koopman model suffice for unbiased policy gradients (without model error dominating) is load-bearing for all performance claims, yet no quantitative one-step prediction error bounds, on-policy error analysis, or ablation isolating model mismatch effects on gradient direction are provided in the method or experiments sections.

    Authors: We acknowledge that the one-step gradient estimation is central to the approach and that additional analysis is required to substantiate it. The original manuscript motivates one-step predictions primarily through reduced error accumulation and computational cost, with empirical support from hardware and simulation results. In revision, we will add: (i) quantitative one-step prediction error bounds (e.g., normalized MSE on held-out and on-policy data), (ii) on-policy error analysis comparing predicted vs. observed trajectories during policy optimization, and (iii) an ablation isolating model mismatch by contrasting gradients from the learned model against those from ground-truth dynamics on simulated tasks. These will appear in the revised Method and Experiments sections. revision: yes

  2. Referee: [Experiments] Experimental results claim superior sample efficiency and control performance, but the manuscript provides no details on model fitting procedure, hyperparameter selection, statistical significance tests, or data exclusion criteria, which prevents confirmation that reported gains are robust (as noted in the abstract's favorable comparisons).

    Authors: We agree that these experimental details are necessary for reproducibility and to confirm robustness. The revised manuscript will expand the Experiments section with: a full description of the Koopman model fitting procedure (data collection, loss, optimizer); hyperparameter tables with selection methodology; performance metrics reported as mean ± std over multiple random seeds together with statistical significance tests (e.g., paired t-tests or Wilcoxon rank-sum); and an explicit statement that no data were excluded. These additions will directly support the abstract claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper learns a Koopman operator-based linear lifted model from interaction data and integrates one-step predictions of this model into an actor-critic policy gradient update. The reported improvements in sample efficiency and control performance are measured via direct experiments on simulated benchmarks and physical hardware (Kinova Gen3 arm and Unitree Go1 quadruped), using actual closed-loop trajectories rather than model-predicted quantities. No equation or claim reduces the final performance metrics to the fitted Koopman parameters by construction, and the method description contains no load-bearing self-citations, imported uniqueness theorems, or ansatzes that collapse the central result to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claim rests on the empirical performance of the learned Koopman model and the one-step gradient estimator.

pith-pipeline@v0.9.0 · 5463 in / 1155 out tokens · 45878 ms · 2026-05-10T01:37:56.723766+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    B. D. Anderson and J. B. Moore,Optimal control: linear quadratic methods. Courier Corporation, 2007

  2. [2]

    Model predictive control,

    B. Kouvaritakis and M. Cannon, “Model predictive control,”Switzerland: Springer International Publishing, vol. 38, no. 13-56, p. 7, 2016

  3. [3]

    Anymal-a highly mobile and dynamic quadrupedal robot,

    M. Hutter, C. Gehring, D. Jud, A. Lauber, C. D. Bellicoso, V . Tsounis, J. Hwangbo, K. Bodie, P. Fankhauser, M. Bloeschet al., “Anymal-a highly mobile and dynamic quadrupedal robot,” in2016 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2016, pp. 38–44

  4. [4]

    Reinforcement learning in continuous time and space,

    K. Doya, “Reinforcement learning in continuous time and space,”Neural computation, vol. 12, no. 1, pp. 219–245, 2000

  5. [5]

    Neural network modeling and an extended dmc algorithm to control nonlinear systems,

    E. Hernandaz and Y . Arkun, “Neural network modeling and an extended dmc algorithm to control nonlinear systems,” inAmerican Control Conference. IEEE, 1990, pp. 2454–2459

  6. [6]

    Real-time dynamic control of an industrial manipulator using a neural network-based learning controller,

    W. Miller, R. Hewes, F. Glanz, and L. Kraft, “Real-time dynamic control of an industrial manipulator using a neural network-based learning controller,”Transactions on Robotics and Automation, vol. 6, no. 1, pp. 1–9, 1990

  7. [7]

    Model predictive control using neural networks,

    A. Draeger, S. Engell, and H. Ranke, “Model predictive control using neural networks,”Control Systems Magazine, vol. 15, no. 5, pp. 61–66, 1995

  8. [8]

    End-to-end training of deep visuomotor policies,

    S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,”The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016

  9. [9]

    Learning to control a low-cost manipulator using data-efficient reinforcement learning,

    M. P. Deisenroth, C. E. Rasmussen, and D. Fox, “Learning to control a low-cost manipulator using data-efficient reinforcement learning,” Robotics: Science and Systems VII, vol. 7, pp. 57–64, 2011

  10. [10]

    Deep reinforcement learning in a handful of trials using probabilistic dynamics models,

    K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” Advances in neural information processing systems, vol. 31, 2018

  11. [11]

    Model predictive inferential control of neural state-space models for autonomous vehicle motion planning,

    I. Askari, A. Vaziri, X. Tu, S. Zeng, and H. Fang, “Model predictive inferential control of neural state-space models for autonomous vehicle motion planning,”IEEE Transactions on Robotics, 2025

  12. [12]

    An approach to stability criteria of neural-network control systems,

    K. Tanaka, “An approach to stability criteria of neural-network control systems,”Transactions on Neural Networks, vol. 7, no. 3, pp. 629–642, 1996

  13. [13]

    Survey of model-based reinforce- ment learning: Applications on robotics,

    A. S. Polydoros and L. Nalpantidis, “Survey of model-based reinforce- ment learning: Applications on robotics,”Journal of Intelligent & Robotic Systems, vol. 86, no. 2, pp. 153–173, 2017

  14. [14]

    Hamiltonian systems and transformation in hilbert space,

    B. O. Koopman, “Hamiltonian systems and transformation in hilbert space,”Proceedings of the national academy of sciences of the united states of america, vol. 17, no. 5, p. 315, 1931

  15. [15]

    On applications of the spectral theory of the koopman operator in dynamical systems and control theory,

    I. Mezi´c, “On applications of the spectral theory of the koopman operator in dynamical systems and control theory,” inConference on Decision and Control. IEEE, 2015, pp. 7034–7041

  16. [16]

    Dynamic mode decom- position with control,

    J. L. Proctor, S. L. Brunton, and J. N. Kutz, “Dynamic mode decom- position with control,”SIAM Journal on Applied Dynamical Systems, vol. 15, no. 1, pp. 142–161, 2016

  17. [17]

    Linear identification of nonlinear systems: A lifting technique based on the koopman operator,

    A. Mauroy and J. Goncalves, “Linear identification of nonlinear systems: A lifting technique based on the koopman operator,” inConference on Decision and Control. IEEE, 2016, pp. 6500–6505. 12

  18. [18]

    On convergence of extended dynamic mode decomposition to the koopman operator,

    M. Korda and I. Mezi ´c, “On convergence of extended dynamic mode decomposition to the koopman operator,”Journal of Nonlinear Science, vol. 28, no. 2, pp. 687–710, 2018

  19. [19]

    Generalizing koopman theory to allow for inputs and control,

    J. L. Proctor, S. L. Brunton, and J. N. Kutz, “Generalizing koopman theory to allow for inputs and control,”SIAM Journal on Applied Dynamical Systems, vol. 17, no. 1, pp. 909–930, 2018

  20. [20]

    Data-driven discovery of koopman eigenfunctions using deep learning,

    B. Lusch, S. L. Brunton, and J. N. Kutz, “Data-driven discovery of koopman eigenfunctions using deep learning,”Bulletin of the American Physical Society, 2017

  21. [21]

    Deep dynamical modeling and control of unsteady fluid flows,

    J. Morton, A. Jameson, M. J. Kochenderfer, and F. Witherden, “Deep dynamical modeling and control of unsteady fluid flows,”Advances in Neural Information Processing Systems, vol. 31, 2018

  22. [22]

    Learning deep neural network representations for koopman operators of nonlinear dynamical systems,

    E. Yeung, S. Kundu, and N. Hodas, “Learning deep neural network representations for koopman operators of nonlinear dynamical systems,” inAmerican Control Conference (ACC). IEEE, 2019, pp. 4832–4839

  23. [23]

    Deep learning for universal linear embeddings of nonlinear dynamics,

    B. Lusch, J. N. Kutz, and S. L. Brunton, “Deep learning for universal linear embeddings of nonlinear dynamics,”Nature communications, vol. 9, no. 1, pp. 1–10, 2018

  24. [24]

    Deep learning of koopman representa- tion for control,

    Y . Han, W. Hao, and U. Vaidya, “Deep learning of koopman representa- tion for control,” inConference on Decision and Control. IEEE, 2020, pp. 1890–1895

  25. [25]

    Koopmanizingflows: Diffeomorphically learning stable koopman opera- tors,

    P. Bevanda, M. Beier, S. Kerz, A. Lederer, S. Sosnowski, and S. Hirche, “Koopmanizingflows: Diffeomorphically learning stable koopman opera- tors,”arXiv preprint arXiv:2112.04085, 2021

  26. [26]

    Online dynamic mode decomposition for time-varying systems,

    H. Zhang, C. W. Rowley, E. A. Deem, and L. N. Cattafesta, “Online dynamic mode decomposition for time-varying systems,”SIAM Journal on Applied Dynamical Systems, vol. 18, no. 3, pp. 1586–1609, 2019

  27. [27]

    Deep koopman learning of nonlinear time-varying systems,

    W. Hao, B. Huang, W. Pan, D. Wu, and S. Mou, “Deep koopman learning of nonlinear time-varying systems,”Automatica, vol. 159, p. 111372, 2024

  28. [28]

    Koopman global linearization of contact dynamics for robot locomotion and manipulation enables elaborate control,

    C. O’Neill, J. Terrones, and H. H. Asada, “Koopman global linearization of contact dynamics for robot locomotion and manipulation enables elaborate control,”arXiv preprint arXiv:2511.06515, 2025

  29. [29]

    Linear predictors for nonlinear dynamical systems: Koopman operator meets model predictive control,

    M. Korda and I. Mezi ´c, “Linear predictors for nonlinear dynamical systems: Koopman operator meets model predictive control,”Automatica, vol. 93, pp. 149–160, 2018

  30. [30]

    Actor-critic–type learning algorithms for markov decision processes,

    V . R. Konda and V . S. Borkar, “Actor-critic–type learning algorithms for markov decision processes,”SIAM Journal on control and Optimization, vol. 38, no. 1, pp. 94–123, 1999

  31. [31]

    Continuous control with deep reinforcement learning,

    T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” International Conference on Learning Representations, 2016

  32. [32]

    On the theory of the brownian motion,

    G. E. Uhlenbeck and L. S. Ornstein, “On the theory of the brownian motion,”Physical review, vol. 36, no. 5, p. 823, 1930

  33. [33]

    Playing Atari with Deep Reinforcement Learning

    V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- stra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013

  34. [34]

    Learning to predict by the methods of temporal differences,

    R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine learning, vol. 3, pp. 9–44, 1988

  35. [35]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014

  36. [36]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational conference on machine learning. Pmlr, 2018, pp. 1861–1870

  37. [37]

    OpenAI Gym

    G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schul- man, J. Tang, and W. Zaremba, “Openai gym,”arXiv preprint arXiv:1606.01540, 2016

  38. [38]

    [Online]

    Kinova Inc.,KINOVA® GEN3 Ultra lightweight robot User Guide, Kinova Inc., 2023. [Online]. Available: https://www.kinovarobotics.com

  39. [39]

    Unitree go1,

    U. Robotics, “Unitree go1,” https://shop.unitree.com, 2021, accessed: 2026-04-11