pith. sign in

arxiv: 2509.11002 · v2 · pith:Z66KYHHPnew · submitted 2025-09-13 · ⚛️ physics.flu-dyn

Real-time reinforcement learning for turbulent state-dependent control in a bluff-body wake

Pith reviewed 2026-05-21 22:38 UTC · model grok-4.3

classification ⚛️ physics.flu-dyn
keywords reinforcement learningturbulent flow controlbluff body wakedrag reductionreal-time controlaerodynamicscoherent structureswind tunnel experiment
0
0 comments X

The pith

A reinforcement learning agent learns real-time state-dependent control of a turbulent bluff-body wake from sparse onboard sensors alone and reduces drag with net energy savings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces REACT, an autonomous reinforcement learning framework that learns directly from experimental measurements in a wind-tunnel setup on an Ahmed-body model. The agent discovers a policy that dynamically suppresses coherent flow structures in the wake, delivering greater drag reduction and energy savings than model-based baselines. Training occurs in a nondimensional state-reward space with Reynolds-number conditioning so that one offline policy remains effective across the tested range without retraining. The work contrasts this dynamics-aware approach with quasi-steady policies that achieve less suppression of instabilities. The results show closed-loop learning is possible in high-Reynolds-number turbulent flows using only onboard data.

Core claim

The REACT agent autonomously converges to a policy that reduces aerodynamic drag while achieving net energy savings by dynamically suppressing spatiotemporally coherent flow structures in the bluff-body wake, achieving two to four times greater performance than model-based baseline controllers, and learns a single offline policy that remains effective across Reynolds numbers 86400 to 518400 by training in nondimensional space and conditioning on Reynolds number for temporal adaptation.

What carries the argument

The REACT reinforcement learning agent trained directly from sparse onboard sensor measurements in a nondimensional state-reward space with Reynolds-number conditioning.

If this is right

  • The policy suppresses spatiotemporally coherent instabilities rather than adjusting only the mean flow.
  • Net energy savings accompany the drag reduction because the control avoids unnecessary actuation.
  • A single policy generalizes across a factor-of-six range in Reynolds number without retraining.
  • State-dependent, dynamics-aware control outperforms representative quasi-steady baselines in this turbulent regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sensor-only learning approach could be tested on other separated flows such as airfoils or vehicles at scale.
  • If the nondimensional formulation holds at even higher Reynolds numbers, model-free control might extend to industrial turbulent systems.
  • Similar agents could be examined for multi-objective goals such as simultaneous drag and noise reduction.

Load-bearing premise

Sparse onboard sensor measurements alone contain sufficient information for the reinforcement learning agent to discover and stably execute a high-performance state-dependent control policy in a real high-Reynolds-number turbulent environment without any turbulence model or prior flow physics knowledge.

What would settle it

Deploy the learned policy at a Reynolds number well above 518400 or with substantially fewer sensors and measure whether drag reduction and net energy savings collapse.

read the original abstract

Controlling turbulent dynamics remains a major challenge because of its chaotic, multi-scale dynamics, which strongly influence the performance of many fluid systems. Here we report REACT (Reinforcement Learning for Environmental Adaptation and Control of Turbulence), an autonomous reinforcement learning framework for real-time state-dependent control of turbulent wake dynamics in a real wind-tunnel environment. Deployed on an Ahmed-body model equipped solely with onboard sensors and servo-actuated surfaces, REACT learns directly from sparse experimental measurements in a wind-tunnel environment, bypassing empirical turbulence models. The agent autonomously converges to a policy that reduces aerodynamic drag while achieving net energy savings. Without prior knowledge of flow physics, it discovers that dynamically suppressing spatiotemporally coherent flow structures in the bluff-body wake maximizes energy efficiency, achieving two to four times greater performance than model-based baseline controllers. We contrast the state-dependent, dynamics-aware policy of REACT with representative quasi-steady, mean-flow-oriented policies learned by standard reinforcement learning baselines, which deliver lower drag reduction and no direct suppression of coherent instabilities in this turbulent-wake regime. Finally, by training in a nondimensional state-reward space whose amplitudes are approximately Reynolds-number-invariant, and by conditioning on Reynolds number for temporal adaptation, REACT learns a single offline policy that remains effective across the tested Reynolds-number range 86,400 to 518,400, without retraining. These results demonstrate autonomous closed-loop reinforcement learning control in a high-Reynolds-number wind-tunnel environment and suggest a path toward data-driven state-dependent control of turbulent flows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents REACT, an autonomous reinforcement learning framework for real-time state-dependent control of turbulent wake dynamics behind a bluff body in a wind-tunnel experiment. Using only onboard sensors and actuators on an Ahmed-body model, the agent learns a policy that reduces drag and achieves net energy savings by suppressing spatiotemporally coherent flow structures, outperforming model-based baselines by a factor of two to four. The approach uses nondimensional state-reward space for generalization across Reynolds numbers from 86,400 to 518,400 without retraining.

Significance. If the central claims hold under additional verification, this would represent a notable experimental demonstration of model-free RL for high-Re turbulent flow control without turbulence models or prior physics knowledge. The cross-Re generalization via nondimensional scaling and the explicit contrast with quasi-steady baselines are strengths that could inform future data-driven aerodynamics work.

major comments (3)
  1. Abstract and results on performance: the claim of 'two to four times greater performance' and 'direct suppression of coherent instabilities' is not supported by reported error bars, number of independent runs, or statistical tests, which is load-bearing for assessing robustness over model-based baselines.
  2. Methods section on sensor configuration: no observability metric, sensor placement diagram, or wake-velocity reconstruction error from the sparse onboard pressure/force measurements is provided, leaving open whether the MDP is sufficiently rich to discover and stabilize suppression of spatiotemporally coherent structures rather than quasi-steady mean-flow adjustment.
  3. RL framework and reward section: the reward weights and scaling factors are listed as free parameters without full specification or sensitivity analysis, which directly affects reproducibility of the reported convergence to a structure-suppressing policy.
minor comments (2)
  1. Clarify the exact nondimensionalization procedure for the state-reward space and how Reynolds-number conditioning is implemented in the policy network.
  2. Figure captions for flow visualizations should explicitly label the coherent structures being suppressed and include quantitative measures of suppression (e.g., modal energy reduction).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments highlight important aspects of statistical robustness, observability, and reproducibility that we address point by point below. We have prepared revisions to strengthen these elements while preserving the core contributions of the work.

read point-by-point responses
  1. Referee: Abstract and results on performance: the claim of 'two to four times greater performance' and 'direct suppression of coherent instabilities' is not supported by reported error bars, number of independent runs, or statistical tests, which is load-bearing for assessing robustness over model-based baselines.

    Authors: We agree that explicit statistical support is necessary to substantiate the performance claims. In the revised manuscript we will report results aggregated over five independent experimental runs per controller, include standard-error bars on all drag-reduction and energy-savings metrics, and add two-sample t-tests confirming that the observed 2–4× improvement relative to the quasi-steady baselines is statistically significant (p < 0.01). We will also include spectral analysis of wake-velocity time series demonstrating statistically significant attenuation of the dominant coherent-structure frequencies under the REACT policy. These additions directly address the robustness concern while leaving the reported performance ratios unchanged. revision: yes

  2. Referee: Methods section on sensor configuration: no observability metric, sensor placement diagram, or wake-velocity reconstruction error from the sparse onboard pressure/force measurements is provided, leaving open whether the MDP is sufficiently rich to discover and stabilize suppression of spatiotemporally coherent structures rather than quasi-steady mean-flow adjustment.

    Authors: We acknowledge the absence of these details. The revised Methods section will include (i) a labeled diagram of the pressure-tap and force-sensor locations on the Ahmed-body model, (ii) an observability Gramian analysis of the chosen state vector, and (iii) quantitative reconstruction error metrics (RMS and spectral) obtained by comparing sparse-sensor estimates against simultaneous PIV measurements in a subset of runs. These additions will demonstrate that the state space captures the essential dynamics of the dominant wake instabilities, supporting the claim that the learned policy targets coherent-structure suppression rather than purely mean-flow adjustment. revision: yes

  3. Referee: RL framework and reward section: the reward weights and scaling factors are listed as free parameters without full specification or sensitivity analysis, which directly affects reproducibility of the reported convergence to a structure-suppressing policy.

    Authors: We will expand the reward-function description to provide the exact numerical values of all weights and scaling factors used in the reported experiments. In addition, we will include a sensitivity study showing that the emergence of the structure-suppressing policy remains consistent across a ±20 % variation in the primary reward coefficients. These revisions will enable full reproducibility without altering the policy or performance results presented in the original manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity: experimental RL results rest on physical measurements

full rationale

The paper reports an empirical demonstration of model-free RL control in a physical wind-tunnel experiment on an Ahmed body. Performance metrics (drag reduction, energy savings, wake structure suppression) are obtained by direct comparison against physical baselines and quasi-steady policies, not by deriving quantities from fitted parameters or self-referential equations. The nondimensional state-reward space and Reynolds-number conditioning are presented as practical design choices for generalization rather than as outputs of a closed mathematical derivation. No load-bearing self-citations, uniqueness theorems, or ansatzes that reduce to the target claim appear in the text. The central results therefore remain self-contained against external experimental benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that reinforcement learning can extract effective control from limited sensor streams in a chaotic flow without explicit physics models; no new physical entities are postulated, but several standard RL hyperparameters and the choice of nondimensional state-reward scaling are implicit free parameters whose values are not reported in the abstract.

free parameters (1)
  • reward weights and scaling factors
    The nondimensional state-reward space and Reynolds-number conditioning require choices of amplitude scaling that are fitted or selected to achieve invariance; these are not numerically specified in the abstract.
axioms (1)
  • domain assumption Reinforcement learning algorithms converge to a useful policy when trained on sparse, noisy experimental measurements from a real turbulent flow.
    Invoked when stating that the agent 'autonomously converges' without prior flow physics knowledge.

pith-pipeline@v0.9.0 · 5820 in / 1482 out tokens · 39586 ms · 2026-05-21T22:38:13.654571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 5 internal anchors

  1. [1]

    Feynman, R.P., Leighton, R.B., Sands, M.: The Feynman Lectures on Physics vol. 1. Addison- Wesley, Reading, MA (1964)

  2. [2]

    Nature443(7107), 59–62 (2006)

    Hof, B., Westerweel, J., Schneider, T.M., Eckhardt, B.: Finite lifetime of turbulence in shear flows. Nature443(7107), 59–62 (2006)

  3. [3]

    Nature526(7574), 550–553 (2015)

    Barkley, D., Song, B., Mukund, V., Lemoult, G., Avila, M., Hof, B.: The rise of fully turbulent flow. Nature526(7574), 550–553 (2015)

  4. [4]

    Nature Physics12(3), 245–248 (2016)

    Shih, H.-Y., Hsieh, T.-L., Goldenfeld, N.: Ecological collapse and the emergence of travelling waves at the onset of shear turbulence. Nature Physics12(3), 245–248 (2016)

  5. [5]

    Nature communications10(1), 2277 (2019)

    Reetz, F., Kreilos, T., Schneider, T.M.: Exact invariant solution reveals the origin of self- organized oblique turbulent-laminar stripes. Nature communications10(1), 2277 (2019)

  6. [6]

    Nature communications5(1), 3820 (2014) 18

    Huisman, S.G., Van Der Veen, R.C., Sun, C., Lohse, D.: Multiple states in highly turbulent Taylor–Couette flow. Nature communications5(1), 3820 (2014) 18

  7. [7]

    Science advances8(19), 4786 (2022)

    Callaham, J.L., Rigas, G., Loiseau, J.-C., Brunton, S.L.: An empirical mean-field model of symmetry-breaking in a turbulent wake. Science advances8(19), 4786 (2022)

  8. [8]

    Nature627(8004), 515–521 (2024)

    Wit, X.M., Fruchart, M., Khain, T., Toschi, F., Vitelli, V.: Pattern formation by turbulent cascades. Nature627(8004), 515–521 (2024)

  9. [9]

    Nature Physics13(11), 1135–1140 (2017)

    Young, R.M., Read, P.L.: Forward and inverse kinetic energy cascades in Jupiter’s turbulent weather layer. Nature Physics13(11), 1135–1140 (2017)

  10. [10]

    Applied Mechanics Reviews67(5), 050801 (2015)

    Brunton, S.L., Noack, B.R.: Closed-loop turbulence control: Progress and challenges. Applied Mechanics Reviews67(5), 050801 (2015)

  11. [11]

    Nature communications12(1), 5805 (2021)

    Marusic, I., Chandran, D., Rouhi, A., Fu, M.K., Wine, D., Holloway, B., Chung, D., Smits, A.J.: An energy-efficient pathway to turbulent drag reduction. Nature communications12(1), 5805 (2021)

  12. [12]

    Annual Review of Control, Robotics, and Autonomous Systems5(1), 579–602 (2022)

    Shapiro, C.R., Starke, G.M., Gayme, D.F.: Turbulence and control of wind farms. Annual Review of Control, Robotics, and Autonomous Systems5(1), 579–602 (2022)

  13. [13]

    Annual Review of Fluid Mechanics40(1), 113–139 (2008)

    Choi, H., Jeon, W.-P., Kim, J.: Control of flow over a bluff body. Annual Review of Fluid Mechanics40(1), 113–139 (2008)

  14. [14]

    Annual Review of Fluid Mechanics39(1), 383–417 (2007)

    Kim, J., Bewley, T.R.: A linear systems approach to flow control. Annual Review of Fluid Mechanics39(1), 383–417 (2007)

  15. [15]

    Annual Review of Fluid Mechanics53(1), 311–345 (2021)

    Jovanovi´ c, M.R.: From bypass transition to flow control and data-driven turbulence modeling: an input–output viewpoint. Annual Review of Fluid Mechanics53(1), 311–345 (2021)

  16. [16]

    Nature620(7976), 982–987 (2023)

    Kaufmann, E., Bauersfeld, L., Loquercio, A., M¨ uller, M., Koltun, V., Scaramuzza, D.: Champion- level drone racing using deep reinforcement learning. Nature620(7976), 982–987 (2023)

  17. [17]

    Nature Machine Intelligence6(7), 787–798 (2024)

    Han, L., Zhu, Q., Sheng, J., Zhang, C., Li, T., Zhang, Y., Zhang, H., Liu, Y., Zhou, C., Zhao, R., et al.: Lifelike agility and play in quadrupedal robots using reinforcement learning and generative pre-trained models. Nature Machine Intelligence6(7), 787–798 (2024)

  18. [18]

    Science Robotics9(89), 9579 (2024)

    Radosavovic, I., Xiao, T., Zhang, B., Darrell, T., Malik, J., Sreenath, K.: Real-world humanoid locomotion with reinforcement learning. Science Robotics9(89), 9579 (2024)

  19. [19]

    The International Journal of Robotics Research39(1), 3–20 (2020)

    Andrychowicz, O.M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A.,et al.: Learning dexterous in-hand manipulation. The International Journal of Robotics Research39(1), 3–20 (2020)

  20. [20]

    Science robotics5(47), 5986 (2020)

    Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., Hutter, M.: Learning quadrupedal locomotion over challenging terrain. Science robotics5(47), 5986 (2020)

  21. [21]

    Nature602(7897), 414–419 (2022)

    Degrave, J., Felici, F., Buchli, J., Neunert, M., Tracey, B., Carpanese, F., Ewalds, T., Hafner, R., Abdolmaleki, A., Las Casas, D.,et al.: Magnetic control of tokamak plasmas through deep reinforcement learning. Nature602(7897), 414–419 (2022)

  22. [22]

    Cambridge University Press, Cambridge (2000)

    Pope, S.B.: Turbulent Flows. Cambridge University Press, Cambridge (2000)

  23. [23]

    In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

    Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30 (2017). IEEE

  24. [24]

    Artificial intelligence101(1-2), 99–134 (1998)

    Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable 19 stochastic domains. Artificial intelligence101(1-2), 99–134 (1998)

  25. [25]

    Nature communications16(1), 1422 (2025)

    Font, B., Alc´ antara-´Avila, F., Rabault, J., Vinuesa, R., Lehmkuhl, O.: Deep reinforcement learn- ing for active flow control in a turbulent separation bubble. Nature communications16(1), 1422 (2025)

  26. [26]

    Journal of Fluid Mechanics984, 9 (2024)

    Wang, Z., Lin, R., Zhao, Z., Chen, X., Guo, P., Yang, N., Wang, Z., Fan, D.: Learn to flap: Foil non-parametric path planning via deep reinforcement learning. Journal of Fluid Mechanics984, 9 (2024)

  27. [27]

    Journal of Fluid Mechanics981, 17 (2024)

    Xia, C., Zhang, J., Kerrigan, E.C., Rigas, G.: Active flow control for bluff body drag reduction using reinforcement learning with partial measurements. Journal of Fluid Mechanics981, 17 (2024)

  28. [28]

    Journal of Fluid Mechanics 960, 30 (2023)

    Sonoda, T., Liu, Z., Itoh, T., Hasegawa, Y.: Reinforcement learning of control strategies for reducing skin friction drag in a fully developed turbulent channel flow. Journal of Fluid Mechanics 960, 30 (2023)

  29. [29]

    Physics of Fluids33(3) (2021)

    Ren, F., Rabault, J., Tang, H.: Applying deep reinforcement learning to active flow control in weakly turbulent conditions. Physics of Fluids33(3) (2021)

  30. [30]

    Rabault, J., Kuchta, M., Jensen, A., R´ eglade, U., Cerardi, N.: Artificial neural networks trained through deep reinforcement learning discover control strategies for active flow control. J. Fluid Mech.865, 281–302 (2019)

  31. [31]

    Proceedings of the National Academy of Sciences115(23), 5849–5854 (2018)

    Verma, S., Novati, G., Koumoutsakos, P.: Efficient collective swimming by harnessing vortices through deep reinforcement learning. Proceedings of the National Academy of Sciences115(23), 5849–5854 (2018)

  32. [32]

    Communications Engineering1(1), 45 (2022)

    Renn, P.I., Gharib, M.: Machine learning for flow-informed aerodynamic control in turbulent wind conditions. Communications Engineering1(1), 45 (2022)

  33. [33]

    Journal of Fluid Mechanics1009, 3 (2025)

    Zong, H., Wu, Y., Li, J., Su, Z., Liang, H.: Closed-loop supersonic flow control with a high-speed experimental deep reinforcement learning framework. Journal of Fluid Mechanics1009, 3 (2025)

  34. [34]

    Proceedings of the National Academy of Sciences117(42), 26091–26098 (2020)

    Fan, D., Yang, L., Wang, Z., Triantafyllou, M.S., Karniadakis, G.E.: Reinforcement learning for bluff body active flow control in experiments and simulations. Proceedings of the National Academy of Sciences117(42), 26091–26098 (2020)

  35. [35]

    Annual Review of Fluid Mechanics52(1), 477–508 (2020)

    Brunton, S.L., Noack, B.R., Koumoutsakos, P.: Machine learning for fluid mechanics. Annual Review of Fluid Mechanics52(1), 477–508 (2020)

  36. [36]

    Journal of Artificial Intelligence Research76, 201–264 (2023)

    Kirk, R., Zhang, A., Grefenstette, E., Rockt¨ aschel, T.: A survey of zero-shot generalisation in deep reinforcement learning. Journal of Artificial Intelligence Research76, 201–264 (2023)

  37. [37]

    npj Computational Materials9(1), 55 (2023)

    Li, K., DeCost, B., Choudhary, K., Greenwood, M., Hattrick-Simpers, J.: A critical examination of robustness and generalizability of machine learning prediction of materials properties. npj Computational Materials9(1), 55 (2023)

  38. [38]

    SAE transactions, 473–503 (1984)

    Ahmed, S.R., Ramm, G., Faltin, G.: Some salient features of the time-averaged ground vehicle wake. SAE transactions, 473–503 (1984)

  39. [39]

    Technical report (1980)

    Postel, J.: User datagram protocol. Technical report (1980)

  40. [40]

    MIT Press, Cambridge, MA (2018)

    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge, MA (2018). Chap. 3 20

  41. [41]

    Soft Actor-Critic Algorithms and Applications

    Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al.: Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905 (2018)

  42. [42]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)

  43. [43]

    Physics of Fluids25(9) (2013)

    Grandemange, M., Gohlke, M., Cadot, O.: Bi-stability in the turbulent wake past parallelepiped bodies with various aspect ratios and wall effects. Physics of Fluids25(9) (2013)

  44. [44]

    Journal of Fluid Mechanics802, 726–749 (2016)

    Brackston, R.D., De La Cruz, J.G., Wynn, A., Rigas, G., Morrison, J.: Stochastic modelling and feedback control of bistability in a turbulent bluff body wake. Journal of Fluid Mechanics802, 726–749 (2016)

  45. [45]

    Atmospheric turbulence and radio wave propagation, 166–178 (1967)

    Lumley, J.L.: The structure of inhomogeneous turbulent flows. Atmospheric turbulence and radio wave propagation, 166–178 (1967)

  46. [46]

    Annual Review of Fluid Mechanics25(1), 539–575 (1993)

    Berkooz, G., Holmes, P., Lumley, J.L.: The proper orthogonal decomposition in the analysis of turbulent flows. Annual Review of Fluid Mechanics25(1), 539–575 (1993)

  47. [47]

    Journal of Fluid Mechanics755, 5 (2014)

    Rigas, G., Oxlade, A., Morgans, A., Morrison, J.: Low-dimensional dynamics of a turbulent axisymmetric wake. Journal of Fluid Mechanics755, 5 (2014)

  48. [48]

    Journal of Fluids and Structures4(3), 231–257 (1990)

    Berger, E., Scholz, D., Schumm, M.: Coherent vortex structures in the wake of a sphere and a circular disk at rest and under forced vibrations. Journal of Fluids and Structures4(3), 231–257 (1990)

  49. [49]

    Layer Normalization

    Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  50. [50]

    http://github.com/jax-ml/jax

    Bradbury, J., Frostig, R., Hawkins, P., Johnson, M.J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., Zhang, Q.: JAX: Composable Transformations of Python+NumPy programs. http://github.com/jax-ml/jax

  51. [51]

    Journal of Machine Learning Research22(268), 1–8 (2021)

    Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., Dormann, N.: Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research22(268), 1–8 (2021)

  52. [52]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Dao, T., Gu, A.: Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060 (2024)

  53. [53]

    In: International Conference on Learning Representations (2020)

    Bouteiller, Y., Ramstedt, S., Beltrame, G., Pal, C., Binas, J.: Reinforcement learning with random delays. In: International Conference on Learning Representations (2020)

  54. [54]

    Neurocomputing450, 119–128 (2021)

    Chen, B., Xu, M., Li, L., Zhao, D.: Delay-aware model-based reinforcement learning for continuous control. Neurocomputing450, 119–128 (2021)

  55. [55]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  56. [56]

    In: Advances in Neural Information Processing Systems, vol

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Informat...

  57. [57]

    Physical Review Fluids9(4), 043902 (2024)

    Chatzimanolakis, M., Weber, P., Koumoutsakos, P.: Learning in two dimensions and controlling 21 in three: Generalizable drag reduction strategies for flows past circular cylinders through deep reinforcement learning. Physical Review Fluids9(4), 043902 (2024)

  58. [58]

    Journal of Fluid Mechanics10(3), 345–356 (1961) Acknowledgements.We acknowledge support from the UKRI AI for Net Zero grant EP/Y005619/1

    Roshko, A.: Experiments on the flow past a circular cylinder at very high Reynolds number. Journal of Fluid Mechanics10(3), 345–356 (1961) Acknowledgements.We acknowledge support from the UKRI AI for Net Zero grant EP/Y005619/1. J.Z is supported by the President’s Scholarship at Imperial College London. Author contribution.J.Z. developed the learning algo...