pith. machine review for the scientific record. sign in

arxiv: 2604.22794 · v1 · submitted 2026-04-13 · 📡 eess.SY · cs.LG· cs.SY

Recognition: unknown

Accelerating Reinforcement Learning for Wind Farm Control via Expert Demonstrations

Julian Quick, Marcus Binder Nilsen, Nikolay Dimitrov, Pierre-Elouan R\'ethor\'e, Tuhfe G\"o\c{c}men

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:51 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SY
keywords reinforcement learningwind farm controlyaw controlpretrainingexpert demonstrationsbehavior cloningwake steering
0
0 comments X

The pith

Pretraining with expert demonstrations lets reinforcement learning wind farm controllers start at baseline performance instead of lagging by 12 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether expert demonstrations from a steady-state wake optimizer can initialize a reinforcement learning agent so it avoids the slow and costly early phase of training when applied to wind farm yaw control. An untrained agent begins well below a simple zero-yaw baseline, but the pretraining step raises starting performance to near that baseline level. In experiments with a four-turbine layout, all agents then converge after 250,000 steps to performance that exceeds a lookup-table controller, which needs twice as many steps to reach roughly 7 percent power gain. This matters because an untrained controller deployed on a real wind farm would otherwise produce reduced power output for a long initial period.

Core claim

Expert demonstrations generated by deploying a steady-state optimizer inside a dynamic wake simulation can initialize both the actor and critic of a reinforcement learning agent through behavior cloning, removing the initial performance penalty and allowing convergence to higher power gains than a lookup-table controller.

What carries the argument

Behavior cloning on expert trajectories from a steady-state optimizer inside the dynamic simulator to initialize the policy and value networks of the reinforcement learning agent.

If this is right

  • Pretrained agents begin near baseline performance rather than 12 percent below it.
  • All training configurations reach similar final performance within 250,000 steps.
  • Final power gains surpass those of a lookup-table controller that requires 500,000 steps for about 7 percent improvement.
  • The method avoids extended periods of reduced power output during the learning phase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pretraining step could shorten deployment time for reinforcement learning controllers on much larger wind farms.
  • The approach might apply to other flow-control problems where steady-state models are available but full dynamics must still be learned online.
  • Further gains could come from mixing these demonstrations with other acceleration techniques such as curriculum learning.

Load-bearing premise

Demonstrations created by the steady-state optimizer transfer effectively when used to initialize the networks for continued learning in the dynamic environment.

What would settle it

The pretrained agent showing a large initial performance gap below the zero-yaw baseline or failing to exceed lookup-table performance after fine-tuning would disprove the benefit of this pretraining.

Figures

Figures reproduced from arXiv: 2604.22794 by Julian Quick, Marcus Binder Nilsen, Nikolay Dimitrov, Pierre-Elouan R\'ethor\'e, Tuhfe G\"o\c{c}men.

Figure 1
Figure 1. Figure 1: Single-case evaluation at 12 m/s and 275◦ . Top: wind farm power gain relative to Greedy for SAC and PyWake. Bottom: yaw control actions per turbine; the PyWake yaw for the two upstream and two downstream turbines coincide. Right: snapshot of the flow field at t = 3000 s with the applied yaw angles [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: compares the fully trained SAC (Large pretraining) to PyWake at U∞ = 10 m/s and θ = 270◦ , a fully aligned inflow case. Curves show the mean and 1 standard deviation across turbulence realizations, where SAC aggregates over 5 seeds × 6 boxes (60 cases) and PyWake over 6 boxes. Background shading indicates which controller attains higher mean power at each time (green for SAC, red for PyWake) and it shows t… view at source ↗
Figure 3
Figure 3. Figure 3: Box plots of wind farm power gain (%) relative to Greedy across wind speeds (left) and wind directions (right) after 0, 2.5 × 105 , and 5 × 105 training steps. Pretraining most strongly improves the untrained policy and low-wind regimes; little gain is observed at edge directions where wake interactions are weak [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Heatmap of the mean percentage increase in power for all pre-training sizes and training steps. Note that the PyWake agent has a mean increase of 5.96%. This value is indicated by the black line in the colorbar on the right Several observations from [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Reinforcement learning (RL) offers a promising approach for adaptive wind farm flow control, yet its practical deployment is hindered by slow training convergence and poor initial performance, factors that could translate to years of reduced power output if an untrained agent were deployed directly. This work investigates whether domain knowledge from steady-state wake models can accelerate RL training and improve initial controller performance. We propose a pretraining methodology in which expert demonstrations are generated by deploying a PyWake-based steady-state optimizer within a dynamic wake simulation (WindGym), then used to initialize both the actor and critic networks of a Soft Actor-Critic agent via behavior cloning. Experiments on a 2x2 wind farm show that pretraining eliminates the costly initial learning phase: while an untrained agent underperforms the greedy zero-yaw baseline by approximately 12%, pretraining raises initial performance to near-baseline levels. During online fine-tuning, all configurations converge within 250,000 environment steps to achieve similar performance, ultimately exceeding that of a lookup-table controller, which reaches approximately 7% power gain after 500,000 steps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes using expert demonstrations generated by a PyWake steady-state optimizer deployed inside the dynamic WindGym simulator to pretrain both actor and critic networks of a Soft Actor-Critic agent via behavior cloning. On a 2x2 wind farm, this pretraining is claimed to eliminate the initial performance dip seen in untrained agents (which underperform the greedy zero-yaw baseline by ~12%), raising initial performance to near-baseline levels; all configurations then converge within 250,000 steps to outperform a lookup-table controller that achieves ~7% power gain after 500,000 steps.

Significance. If the transfer from steady-state expert trajectories to the dynamic simulator holds, the approach offers a practical way to reduce the multi-year power losses that would otherwise occur during RL training in operational wind farms. It demonstrates a hybrid model-based initialization strategy that could generalize to other adaptive flow-control problems in energy systems. The absence of robustness metrics and ablations, however, limits the strength of this assessment.

major comments (2)
  1. [Abstract] Abstract: the central claim that pretraining eliminates the initial learning phase depends on effective transfer of steady-state PyWake demonstrations to initialize SAC networks in the dynamic WindGym environment. The reported result—that pretraining only reaches 'near-baseline' performance rather than the higher gain expected from the optimizer—suggests possible partial or negative transfer arising from the mismatch between equilibrium wake assumptions and time-varying inflow/wake advection; an ablation isolating demonstration quality (e.g., random vs. expert trajectories) is required to confirm the source of the observed improvement.
  2. [Abstract] Abstract and Experiments: numerical gains (12% underperformance, 7% power gain, convergence at 250k steps) are reported without error bars, statistical significance tests, hyperparameter schedules, or ablation studies. This omission makes it impossible to determine whether the elimination of the initial dip and final outperformance are robust or sensitive to random seeds and tuning choices.
minor comments (1)
  1. [Abstract] Abstract: define 'near-baseline levels' quantitatively (e.g., percentage of greedy zero-yaw power) and specify the exact number of independent runs used to obtain the reported figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which identify key areas where additional analysis can strengthen the manuscript. We address each major comment below and have revised the paper to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that pretraining eliminates the initial learning phase depends on effective transfer of steady-state PyWake demonstrations to initialize SAC networks in the dynamic WindGym environment. The reported result—that pretraining only reaches 'near-baseline' performance rather than the higher gain expected from the optimizer—suggests possible partial or negative transfer arising from the mismatch between equilibrium wake assumptions and time-varying inflow/wake advection; an ablation isolating demonstration quality (e.g., random vs. expert trajectories) is required to confirm the source of the observed improvement.

    Authors: We agree that the gap between the steady-state optimizer gain and the near-baseline performance after pretraining indicates partial transfer attributable to the mismatch between equilibrium assumptions and dynamic wake advection. To isolate the contribution of demonstration quality, we have added an ablation study in the revised manuscript that compares behavior cloning on expert PyWake trajectories against random trajectories of equivalent length. The results show that only the expert demonstrations eliminate the initial dip and accelerate convergence, confirming that the observed benefit arises from the quality of the demonstrations rather than pretraining in general. revision: yes

  2. Referee: [Abstract] Abstract and Experiments: numerical gains (12% underperformance, 7% power gain, convergence at 250k steps) are reported without error bars, statistical significance tests, hyperparameter schedules, or ablation studies. This omission makes it impossible to determine whether the elimination of the initial dip and final outperformance are robust or sensitive to random seeds and tuning choices.

    Authors: We acknowledge that the absence of error bars, statistical tests, and hyperparameter details limits assessment of robustness. In the revised manuscript we now report all key metrics with error bars computed across five independent random seeds, include statistical significance tests (paired t-tests) for the main performance comparisons, provide the full hyperparameter schedules and training details in an appendix, and expand the set of ablation studies (including the demonstration-quality ablation noted above). These additions confirm that the elimination of the initial dip and the final outperformance are consistent across seeds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results benchmarked against independent external controllers

full rationale

The paper describes an experimental pipeline that generates expert trajectories via a PyWake steady-state optimizer inside the WindGym dynamic simulator, uses them for behavior-cloning initialization of SAC actor and critic networks, and then reports online fine-tuning performance. All quantitative claims (initial performance lift, convergence within 250k steps, final power gain) are obtained by direct comparison to separate, non-derived baselines—the greedy zero-yaw policy and a lookup-table controller—rather than by any equation or fitted parameter that reduces the reported gains to quantities defined from the same data. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the core result; the derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of steady-state wake optima to dynamic simulation trajectories and on standard reinforcement-learning assumptions about policy and value function approximation.

axioms (1)
  • domain assumption Steady-state wake models produce expert trajectories that are sufficiently informative for initializing policies in dynamic wake simulations
    Invoked when PyWake optimizer outputs are used directly as demonstrations inside WindGym.

pith-pipeline@v0.9.0 · 5516 in / 1240 out tokens · 50403 ms · 2026-05-10T15:51:11.547725+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    Veers P, Dykes K, Lantz E, Barth S, Bottasso C L, Carlson O, Clifton A, Green J, Green P, Holttinen H et al.2019 Grand challenges in the science of wind energyScience366eaau2027

  2. [2]

    Meyers J, Bottasso C, Dykes K, Fleming P, Gebraad P, Giebel G, Göçmen T and Van Wingerden J W 2022 Wind farm flow control: prospects and challengesWind Energy Science Discussions20221–56

  3. [3]

    Howland M F and Dabiri J O 2020 Influence of wake model superposition and secondary steering on model- based wake steering control with SCADA data assimilationEnergies

  4. [4]

    Abkar M, Zehtabiyan-Rezaie N and Iosifidis A 2023 Reinforcement learning for wind-farm flow control: Current state and future actionsTheoretical and Applied Mechanics Letters100475

  5. [5]

    Göçmen T, Liew J, Kadoche E, Dimitrov N, Riva R, Andersen S J, Lio A W, Quick J, Réthoré P E and Dykes K 2024 Data-driven wind farm flow control and challenges towards field implementationRenewable and Sustainable Energy ReviewsUnder Review

  6. [6]

    Duan Y, Chen X, Houthooft R, Schulman J and Abbeel P 2016 Benchmarking deep reinforcement learning for continuous control (Preprint1604.06778) URLhttps://arxiv.org/abs/1604.06778

  7. [7]

    Zhao H, Zhao J, Qiu J, Liang G and Dong Z Y 2020 Cooperative wind farm control with deep reinforcement learning and knowledge-assisted learningIEEE Transactions on Industrial Informatics166912–6921

  8. [8]

    Stanfel P, Johnson K, Bay C J and King J 2020 A distributed reinforcement learning yaw control approach for wind farm energy capture maximization2020 American Control Conference (ACC)pp 4065–4070

  9. [9]

    version 4.2.1GitHub repositoryURLhttps://github.com/NREL/floris

    NREL 2024 FLORIS. version 4.2.1GitHub repositoryURLhttps://github.com/NREL/floris

  10. [10]

    Bizon Monroc C, Bušić A, Dubuc D and Zhu J 2024 Towards fine tuning wake steering policies in the field: an imitation-based approachJournal of Physics: Conference Series2767032017 URLhttps://doi.org/ 10.1088/1742-6596/2767/3/032017

  11. [11]

    DTU 2025 WindgymAvailable:https://github.com/DTUWindEnergy/WindGym[Accessed: 27-05-2025]

  12. [12]

    Haarnoja T, Zhou A, Abbeel P and Levine S 2018 Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor (Preprint1801.01290) URLhttps://arxiv.org/abs/1801. 01290

  13. [13]

    Pedersen M M, Steiner J, Nilsen M B, Lohmann J, Hodgson E L, Riva R, Troldborg N, Andersen S J, Larsen G, Verelst D R and Réthoré P E 2026 Dynamiks 0.0.4: An open-source dynamic wind system simulator URL https://gitlab.windenergy.dtu.dk/DYNAMIKS/dynamiks

  14. [14]

    org/preprints/wes-2025-200/

    Steiner J, Hodgson E L, van der Laan M Pet al.2025 A multi-fidelity model benchmark for wake steering of a large turbine in a neutral ABLWind Energy Science Discussions20251–32 URLhttps://wes.copernicus. org/preprints/wes-2025-200/

  15. [15]

    Larsen G C, Aagaard Madsen H and Bingöl F 2007 Dynamic wake meandering modeling

  16. [16]

    Bak C, Zahle F, Bitsche R, Kim T, Yde A, Henriksen L, Hansen M, Blasques J, Gaunaa M and Natarajan A 2013 The DTU 10-MW reference wind turbine danish Wind Power Research 2013; Conference date: 27-05-2013 Through 28-05-2013

  17. [17]

    Pedersen M M, Forsting A M, van der Laan P, Riva R, Romàn L A A, Risco J C, Friis-Møller M, Quick J, Christiansen J P S, Rodrigues R V, Olsen B T and Réthoré P E 2023 Pywake 2.5.0: An open-source wind farm simulation tool URLhttps://gitlab.windenergy.dtu.dk/TOPFARM/PyWake

  18. [18]

    Neustroev G, Andringa S P, Verzijlbergh R A and De Weerdt M M 2022 Deep reinforcement learning for active wake controlProceedings of the 21st International Conference on Autonomous Agents and Multiagent Systemspp 944–953

  19. [19]

    Fleming P A, Stanley A P J, Bay C J, King J, Simley E, Doekemeijer B M and Mudafort R 2022 Serial-refine method for fast wake-steering yaw optimizationJournal of Physics: Conference Series2265032109 URL https://dx.doi.org/10.1088/1742-6596/2265/3/032109