arxiv: 2604.22794 · v1 · submitted 2026-04-13 · 📡 eess.SY · cs.LG· cs.SY

Recognition: unknown

Accelerating Reinforcement Learning for Wind Farm Control via Expert Demonstrations

Julian Quick, Marcus Binder Nilsen, Nikolay Dimitrov, Pierre-Elouan R\'ethor\'e, Tuhfe G\"o\c{c}men

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:51 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SY

keywords reinforcement learningwind farm controlyaw controlpretrainingexpert demonstrationsbehavior cloningwake steering

0 comments

The pith

Pretraining with expert demonstrations lets reinforcement learning wind farm controllers start at baseline performance instead of lagging by 12 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether expert demonstrations from a steady-state wake optimizer can initialize a reinforcement learning agent so it avoids the slow and costly early phase of training when applied to wind farm yaw control. An untrained agent begins well below a simple zero-yaw baseline, but the pretraining step raises starting performance to near that baseline level. In experiments with a four-turbine layout, all agents then converge after 250,000 steps to performance that exceeds a lookup-table controller, which needs twice as many steps to reach roughly 7 percent power gain. This matters because an untrained controller deployed on a real wind farm would otherwise produce reduced power output for a long initial period.

Core claim

Expert demonstrations generated by deploying a steady-state optimizer inside a dynamic wake simulation can initialize both the actor and critic of a reinforcement learning agent through behavior cloning, removing the initial performance penalty and allowing convergence to higher power gains than a lookup-table controller.

What carries the argument

Behavior cloning on expert trajectories from a steady-state optimizer inside the dynamic simulator to initialize the policy and value networks of the reinforcement learning agent.

If this is right

Pretrained agents begin near baseline performance rather than 12 percent below it.
All training configurations reach similar final performance within 250,000 steps.
Final power gains surpass those of a lookup-table controller that requires 500,000 steps for about 7 percent improvement.
The method avoids extended periods of reduced power output during the learning phase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pretraining step could shorten deployment time for reinforcement learning controllers on much larger wind farms.
The approach might apply to other flow-control problems where steady-state models are available but full dynamics must still be learned online.
Further gains could come from mixing these demonstrations with other acceleration techniques such as curriculum learning.

Load-bearing premise

Demonstrations created by the steady-state optimizer transfer effectively when used to initialize the networks for continued learning in the dynamic environment.

What would settle it

The pretrained agent showing a large initial performance gap below the zero-yaw baseline or failing to exceed lookup-table performance after fine-tuning would disprove the benefit of this pretraining.

Figures

Figures reproduced from arXiv: 2604.22794 by Julian Quick, Marcus Binder Nilsen, Nikolay Dimitrov, Pierre-Elouan R\'ethor\'e, Tuhfe G\"o\c{c}men.

**Figure 1.** Figure 1: Single-case evaluation at 12 m/s and 275◦ . Top: wind farm power gain relative to Greedy for SAC and PyWake. Bottom: yaw control actions per turbine; the PyWake yaw for the two upstream and two downstream turbines coincide. Right: snapshot of the flow field at t = 3000 s with the applied yaw angles [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: compares the fully trained SAC (Large pretraining) to PyWake at U∞ = 10 m/s and θ = 270◦ , a fully aligned inflow case. Curves show the mean and 1 standard deviation across turbulence realizations, where SAC aggregates over 5 seeds × 6 boxes (60 cases) and PyWake over 6 boxes. Background shading indicates which controller attains higher mean power at each time (green for SAC, red for PyWake) and it shows t… view at source ↗

**Figure 3.** Figure 3: Box plots of wind farm power gain (%) relative to Greedy across wind speeds (left) and wind directions (right) after 0, 2.5 × 105 , and 5 × 105 training steps. Pretraining most strongly improves the untrained policy and low-wind regimes; little gain is observed at edge directions where wake interactions are weak [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Heatmap of the mean percentage increase in power for all pre-training sizes and training steps. Note that the PyWake agent has a mean increase of 5.96%. This value is indicated by the black line in the colorbar on the right Several observations from [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Reinforcement learning (RL) offers a promising approach for adaptive wind farm flow control, yet its practical deployment is hindered by slow training convergence and poor initial performance, factors that could translate to years of reduced power output if an untrained agent were deployed directly. This work investigates whether domain knowledge from steady-state wake models can accelerate RL training and improve initial controller performance. We propose a pretraining methodology in which expert demonstrations are generated by deploying a PyWake-based steady-state optimizer within a dynamic wake simulation (WindGym), then used to initialize both the actor and critic networks of a Soft Actor-Critic agent via behavior cloning. Experiments on a 2x2 wind farm show that pretraining eliminates the costly initial learning phase: while an untrained agent underperforms the greedy zero-yaw baseline by approximately 12%, pretraining raises initial performance to near-baseline levels. During online fine-tuning, all configurations converge within 250,000 environment steps to achieve similar performance, ultimately exceeding that of a lookup-table controller, which reaches approximately 7% power gain after 500,000 steps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pretraining an SAC agent on steady-state PyWake demos skips the initial performance dip in WindGym but ends up with similar final gains to other methods and shows signs of domain mismatch.

read the letter

The key point is that behavior cloning from a steady-state optimizer lets the RL agent avoid starting well below the greedy baseline, but the final power gains stay modest and the transfer from static to dynamic wakes looks incomplete. The paper runs a 2x2 farm case where an untrained SAC starts about 12% below zero-yaw control, the pretrained version starts near baseline, and after 250k steps all versions beat a lookup table that reaches 7% gain at 500k steps. That initial-phase fix is the practical takeaway for anyone who has watched RL agents waste energy while learning in simulation. The combination of PyWake-generated expert trajectories fed into both actor and critic of SAC inside WindGym is not previously reported in the cited wind-farm RL work, so the pipeline itself counts as new for this application. The experiments are run in a standard dynamic simulator and compare against clear external baselines, which keeps the claims grounded. The main weaknesses are the lack of error bars, statistical tests, or ablations that would show whether the demos actually drive the improvement or if network initialization alone would do similar work. The fact that pretraining only reaches near-baseline rather than the higher gain the PyWake expert should deliver points to a real gap between steady-state assumptions and the time-varying wakes in WindGym. Without a direct check of how well the cloned policy matches the expert inside the dynamic environment, it is hard to know how much useful knowledge actually transferred. This paper is aimed at researchers who want to move RL wind-farm controllers from simulation toward deployment and need to solve the cold-start problem. Readers working on applied RL in energy systems will find the setup and the reported numbers useful even if they want more controls on robustness. It deserves a serious referee because the problem is concrete, the method is reproducible in principle, and the results are presented plainly enough to evaluate. I would send it to peer review with requests for ablations and transfer diagnostics.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes using expert demonstrations generated by a PyWake steady-state optimizer deployed inside the dynamic WindGym simulator to pretrain both actor and critic networks of a Soft Actor-Critic agent via behavior cloning. On a 2x2 wind farm, this pretraining is claimed to eliminate the initial performance dip seen in untrained agents (which underperform the greedy zero-yaw baseline by ~12%), raising initial performance to near-baseline levels; all configurations then converge within 250,000 steps to outperform a lookup-table controller that achieves ~7% power gain after 500,000 steps.

Significance. If the transfer from steady-state expert trajectories to the dynamic simulator holds, the approach offers a practical way to reduce the multi-year power losses that would otherwise occur during RL training in operational wind farms. It demonstrates a hybrid model-based initialization strategy that could generalize to other adaptive flow-control problems in energy systems. The absence of robustness metrics and ablations, however, limits the strength of this assessment.

major comments (2)

[Abstract] Abstract: the central claim that pretraining eliminates the initial learning phase depends on effective transfer of steady-state PyWake demonstrations to initialize SAC networks in the dynamic WindGym environment. The reported result—that pretraining only reaches 'near-baseline' performance rather than the higher gain expected from the optimizer—suggests possible partial or negative transfer arising from the mismatch between equilibrium wake assumptions and time-varying inflow/wake advection; an ablation isolating demonstration quality (e.g., random vs. expert trajectories) is required to confirm the source of the observed improvement.
[Abstract] Abstract and Experiments: numerical gains (12% underperformance, 7% power gain, convergence at 250k steps) are reported without error bars, statistical significance tests, hyperparameter schedules, or ablation studies. This omission makes it impossible to determine whether the elimination of the initial dip and final outperformance are robust or sensitive to random seeds and tuning choices.

minor comments (1)

[Abstract] Abstract: define 'near-baseline levels' quantitatively (e.g., percentage of greedy zero-yaw power) and specify the exact number of independent runs used to obtain the reported figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which identify key areas where additional analysis can strengthen the manuscript. We address each major comment below and have revised the paper to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that pretraining eliminates the initial learning phase depends on effective transfer of steady-state PyWake demonstrations to initialize SAC networks in the dynamic WindGym environment. The reported result—that pretraining only reaches 'near-baseline' performance rather than the higher gain expected from the optimizer—suggests possible partial or negative transfer arising from the mismatch between equilibrium wake assumptions and time-varying inflow/wake advection; an ablation isolating demonstration quality (e.g., random vs. expert trajectories) is required to confirm the source of the observed improvement.

Authors: We agree that the gap between the steady-state optimizer gain and the near-baseline performance after pretraining indicates partial transfer attributable to the mismatch between equilibrium assumptions and dynamic wake advection. To isolate the contribution of demonstration quality, we have added an ablation study in the revised manuscript that compares behavior cloning on expert PyWake trajectories against random trajectories of equivalent length. The results show that only the expert demonstrations eliminate the initial dip and accelerate convergence, confirming that the observed benefit arises from the quality of the demonstrations rather than pretraining in general. revision: yes
Referee: [Abstract] Abstract and Experiments: numerical gains (12% underperformance, 7% power gain, convergence at 250k steps) are reported without error bars, statistical significance tests, hyperparameter schedules, or ablation studies. This omission makes it impossible to determine whether the elimination of the initial dip and final outperformance are robust or sensitive to random seeds and tuning choices.

Authors: We acknowledge that the absence of error bars, statistical tests, and hyperparameter details limits assessment of robustness. In the revised manuscript we now report all key metrics with error bars computed across five independent random seeds, include statistical significance tests (paired t-tests) for the main performance comparisons, provide the full hyperparameter schedules and training details in an appendix, and expand the set of ablation studies (including the demonstration-quality ablation noted above). These additions confirm that the elimination of the initial dip and the final outperformance are consistent across seeds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results benchmarked against independent external controllers

full rationale

The paper describes an experimental pipeline that generates expert trajectories via a PyWake steady-state optimizer inside the WindGym dynamic simulator, uses them for behavior-cloning initialization of SAC actor and critic networks, and then reports online fine-tuning performance. All quantitative claims (initial performance lift, convergence within 250k steps, final power gain) are obtained by direct comparison to separate, non-derived baselines—the greedy zero-yaw policy and a lookup-table controller—rather than by any equation or fitted parameter that reduces the reported gains to quantities defined from the same data. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the core result; the derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of steady-state wake optima to dynamic simulation trajectories and on standard reinforcement-learning assumptions about policy and value function approximation.

axioms (1)

domain assumption Steady-state wake models produce expert trajectories that are sufficiently informative for initializing policies in dynamic wake simulations
Invoked when PyWake optimizer outputs are used directly as demonstrations inside WindGym.

pith-pipeline@v0.9.0 · 5516 in / 1240 out tokens · 50403 ms · 2026-05-10T15:51:11.547725+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Veers P, Dykes K, Lantz E, Barth S, Bottasso C L, Carlson O, Clifton A, Green J, Green P, Holttinen H et al.2019 Grand challenges in the science of wind energyScience366eaau2027

2019
[2]

Meyers J, Bottasso C, Dykes K, Fleming P, Gebraad P, Giebel G, Göçmen T and Van Wingerden J W 2022 Wind farm flow control: prospects and challengesWind Energy Science Discussions20221–56

2022
[3]

Howland M F and Dabiri J O 2020 Influence of wake model superposition and secondary steering on model- based wake steering control with SCADA data assimilationEnergies

2020
[4]

Abkar M, Zehtabiyan-Rezaie N and Iosifidis A 2023 Reinforcement learning for wind-farm flow control: Current state and future actionsTheoretical and Applied Mechanics Letters100475

2023
[5]

Göçmen T, Liew J, Kadoche E, Dimitrov N, Riva R, Andersen S J, Lio A W, Quick J, Réthoré P E and Dykes K 2024 Data-driven wind farm flow control and challenges towards field implementationRenewable and Sustainable Energy ReviewsUnder Review

2024
[6]

Duan Y, Chen X, Houthooft R, Schulman J and Abbeel P 2016 Benchmarking deep reinforcement learning for continuous control (Preprint1604.06778) URLhttps://arxiv.org/abs/1604.06778

work page arXiv 2016
[7]

Zhao H, Zhao J, Qiu J, Liang G and Dong Z Y 2020 Cooperative wind farm control with deep reinforcement learning and knowledge-assisted learningIEEE Transactions on Industrial Informatics166912–6921

2020
[8]

Stanfel P, Johnson K, Bay C J and King J 2020 A distributed reinforcement learning yaw control approach for wind farm energy capture maximization2020 American Control Conference (ACC)pp 4065–4070

2020
[9]

version 4.2.1GitHub repositoryURLhttps://github.com/NREL/floris

NREL 2024 FLORIS. version 4.2.1GitHub repositoryURLhttps://github.com/NREL/floris

2024
[10]

Bizon Monroc C, Bušić A, Dubuc D and Zhu J 2024 Towards fine tuning wake steering policies in the field: an imitation-based approachJournal of Physics: Conference Series2767032017 URLhttps://doi.org/ 10.1088/1742-6596/2767/3/032017

work page doi:10.1088/1742-6596/2767/3/032017 2024
[11]

DTU 2025 WindgymAvailable:https://github.com/DTUWindEnergy/WindGym[Accessed: 27-05-2025]

2025
[12]

Haarnoja T, Zhou A, Abbeel P and Levine S 2018 Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor (Preprint1801.01290) URLhttps://arxiv.org/abs/1801. 01290

work page internal anchor Pith review arXiv 2018
[13]

Pedersen M M, Steiner J, Nilsen M B, Lohmann J, Hodgson E L, Riva R, Troldborg N, Andersen S J, Larsen G, Verelst D R and Réthoré P E 2026 Dynamiks 0.0.4: An open-source dynamic wind system simulator URL https://gitlab.windenergy.dtu.dk/DYNAMIKS/dynamiks

2026
[14]

org/preprints/wes-2025-200/

Steiner J, Hodgson E L, van der Laan M Pet al.2025 A multi-fidelity model benchmark for wake steering of a large turbine in a neutral ABLWind Energy Science Discussions20251–32 URLhttps://wes.copernicus. org/preprints/wes-2025-200/

2025
[15]

Larsen G C, Aagaard Madsen H and Bingöl F 2007 Dynamic wake meandering modeling

2007
[16]

Bak C, Zahle F, Bitsche R, Kim T, Yde A, Henriksen L, Hansen M, Blasques J, Gaunaa M and Natarajan A 2013 The DTU 10-MW reference wind turbine danish Wind Power Research 2013; Conference date: 27-05-2013 Through 28-05-2013

2013
[17]

Pedersen M M, Forsting A M, van der Laan P, Riva R, Romàn L A A, Risco J C, Friis-Møller M, Quick J, Christiansen J P S, Rodrigues R V, Olsen B T and Réthoré P E 2023 Pywake 2.5.0: An open-source wind farm simulation tool URLhttps://gitlab.windenergy.dtu.dk/TOPFARM/PyWake

2023
[18]

Neustroev G, Andringa S P, Verzijlbergh R A and De Weerdt M M 2022 Deep reinforcement learning for active wake controlProceedings of the 21st International Conference on Autonomous Agents and Multiagent Systemspp 944–953

2022
[19]

Fleming P A, Stanley A P J, Bay C J, King J, Simley E, Doekemeijer B M and Mudafort R 2022 Serial-refine method for fast wake-steering yaw optimizationJournal of Physics: Conference Series2265032109 URL https://dx.doi.org/10.1088/1742-6596/2265/3/032109

work page doi:10.1088/1742-6596/2265/3/032109 2022