arxiv: 2605.11102 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI· cs.SY· eess.SY

Recognition: 2 theorem links

· Lean Theorem

Newton's Lantern: A Reinforcement Learning Framework for Finetuning AC Power Flow Warm Start Models

Dhruv Suri, Helgi Hilmarsson, Shourya Bose

Pith reviewed 2026-05-13 07:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SY

keywords AC power flowNewton-Raphson methodreinforcement learningwarm startvoltage collapsepower systems optimizationpolicy optimization

0 comments

The pith

Newton's Lantern finetunes AC power flow warm starts with reinforcement learning to guarantee convergence on all test cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the number of Newton-Raphson iterations for solving AC power flow depends on the direction of the initial error rather than its size. This leads to a lower bound that becomes ineffective near voltage collapse, explaining why supervised warm-start methods fail there. To address this, the authors develop Newton's Lantern, which uses reinforcement learning to optimize the warm-start predictions by treating iteration count as the reward signal. The approach combines a policy optimized via group relative policy optimization with a reward model learned from perturbations. On standard power system benchmarks, it is the only method shown to converge for every instance while using the fewest iterations on average.

Core claim

By proving that iteration count is bounded below by a term involving the alignment of the warm-start error with the Jacobian's singular vectors, the work shows supervised regression is insufficient near bifurcations. Newton's Lantern instead learns a policy that adjusts the base model's output to minimize actual iteration counts through a learned reward proxy, ensuring reliable convergence across large networks.

What carries the argument

Group relative policy optimization of a policy that perturbs base warm-start predictions, guided by a reward model trained to predict iteration counts from error perturbations.

Load-bearing premise

The reward model accurately estimates iteration counts for the policy's proposed warm starts based on training perturbations.

What would settle it

A new test snapshot near voltage collapse where the RL-generated warm start requires more iterations than a simple supervised prediction or fails to converge.

Figures

Figures reproduced from arXiv: 2605.11102 by Dhruv Suri, Helgi Hilmarsson, Shourya Bose.

**Figure 2.** Figure 2: IEEE 14-bus diagnostics for Theorem 3.1 and Corollary 3.2. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Neural warm starts can sharply reduce the number of Newton-Raphson iterations required to solve the AC power flow problem, but existing supervised approaches generalize poorly on heavily loaded instances near voltage collapse. We prove a lower bound on the Newton-Raphson iteration count that depends on the direction of the warm start error rather than on its magnitude, and show as a corollary that the bound becomes vacuous as the smallest singular value of the power-flow Jacobian shrinks, identifying the failure mode of supervised regression near the saddle-node bifurcation. Motivated by this analysis, we introduce Newton's Lantern, a finetuning pipeline that combines group relative policy optimization with a learned reward model trained on perturbations of the base model's predictions, using the iteration count itself as the supervisory signal. Across IEEE 118-bus, GOC 500-bus, and GOC 2000-bus benchmarks, Newton's Lantern is the only method that converges on every test snapshot while attaining the smallest mean iteration count.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pairs a direction-sensitive lower bound on Newton-Raphson steps with an RL finetuning loop that directly targets iteration count, and the reported results show it as the only method that converges on every snapshot across the three grids.

read the letter

The main point is that this work explains why plain supervised warm starts fail near voltage collapse and then uses RL to steer predictions into directions that cut iterations. They derive a lower bound on Newton-Raphson steps that depends on the angle of the warm-start error rather than its size, and show the bound loses force when the Jacobian's smallest singular value shrinks. That analysis lines up with the known trouble spots in heavily loaded cases and gives a clear reason to move beyond regression alone. The finetuning step trains a reward model on perturbations of the base predictor to estimate iteration count, then applies group relative policy optimization to adjust the model. Because the reward comes from actual solver runs, it avoids some of the usual self-referential traps. On the IEEE 118-bus, GOC 500-bus, and GOC 2000-bus sets, the method is the only one that converges everywhere while posting the lowest mean iteration count. That combination of bound and empirical outcome is the real addition relative to earlier supervised warm-start papers. The soft spot is the reward model's coverage. It is fit only on perturbations around the base model, yet the policy can generate warm starts farther away; without explicit checks that the reward still tracks true iteration counts on policy outputs, the gains could partly reflect regularization rather than the intended direction effect. The abstract claims a proof but leaves the derivation and any tightness results for the full text. This is aimed at people who run repeated AC power flow solves in planning or contingency work. Readers who need measurable speedups on standard test systems will find concrete value. The idea is coherent enough and the claims are testable enough that it should go to peer review, with the referees asked to verify the bound derivation and the reward alignment on the final policy.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Newton's Lantern, a reinforcement learning framework for finetuning neural models that provide warm starts for solving the AC power flow problem using Newton-Raphson iteration. It derives a lower bound on the number of iterations that depends on the direction of the warm-start error vector rather than its magnitude, with a corollary showing the bound becomes uninformative near the saddle-node bifurcation where the Jacobian's smallest singular value approaches zero. The method uses group relative policy optimization guided by a learned reward model trained on perturbations of a base supervised predictor, with the actual iteration count serving as the reward signal. On IEEE 118-bus, GOC 500-bus, and GOC 2000-bus test sets, the approach is reported to be the only one achieving convergence on all snapshots while recording the lowest average iteration counts.

Significance. If the theoretical bound is correctly derived and the empirical gains are attributable to the direction-shaping mechanism rather than incidental regularization, the work would offer a principled way to improve warm-start quality for power-flow solvers in challenging operating regimes. The explicit connection between error direction and iteration count, combined with the RL finetuning pipeline that directly optimizes the observable iteration count, represents a substantive advance over purely supervised regression approaches. The manuscript ships a proof of the iteration bound and reproducible benchmarks on standard power-system test cases, which strengthens the assessment.

major comments (3)

[Theoretical analysis (likely §3)] The lower bound on Newton-Raphson iterations is stated to depend on the direction of the warm-start error; however, the full derivation is not reproduced in the provided abstract, and the corollary linking the bound to the smallest singular value of the Jacobian requires explicit verification that the bound indeed becomes vacuous as σ_min → 0. This is load-bearing for motivating the RL approach over supervised learning.
[Method and Experiments (likely §4-5)] The learned reward model is trained exclusively on perturbations around the base supervised model's predictions, yet the RL policy (GRPO) can generate warm starts outside this distribution. No validation is reported of the reward model's prediction accuracy on the actual policy outputs (e.g., correlation or error metrics between predicted and true iteration counts on policy samples). This mismatch risks optimizing a misaligned surrogate, undermining the claim that gains arise from shaping error direction as predicted by the bound.
[Empirical results (likely Table 1 or §5)] The headline result that Newton's Lantern is the only method converging on every test snapshot with the smallest mean iteration count lacks reported error bars, standard deviations, or statistical significance tests across the multiple benchmarks. Without these, it is difficult to assess whether the observed superiority is robust or could be explained by training variance.

minor comments (2)

[Notation] Clarify the precise definition of the warm-start error vector and how the direction is quantified in the bound.
[Related work] Ensure comparison to other RL or optimization-based warm-start methods is comprehensive.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful and constructive comments, which help clarify the presentation of our theoretical results and strengthen the empirical validation. We address each major comment point by point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Theoretical analysis (likely §3)] The lower bound on Newton-Raphson iterations is stated to depend on the direction of the warm-start error; however, the full derivation is not reproduced in the provided abstract, and the corollary linking the bound to the smallest singular value of the Jacobian requires explicit verification that the bound indeed becomes vacuous as σ_min → 0. This is load-bearing for motivating the RL approach over supervised learning.

Authors: We will reproduce the complete derivation of the lower bound (currently in the full manuscript but not excerpted in the abstract) in the revised §3, including all steps showing dependence on error direction rather than magnitude. For the corollary, we will add an explicit verification: as σ_min → 0 the iteration lower bound diverges to infinity (via the 1/σ_min term in the expression), rendering it vacuous near the saddle-node bifurcation. This will be presented with a short proof sketch to directly motivate why supervised regression fails in that regime while the RL direction-shaping approach remains effective. revision: yes
Referee: [Method and Experiments (likely §4-5)] The learned reward model is trained exclusively on perturbations around the base supervised model's predictions, yet the RL policy (GRPO) can generate warm starts outside this distribution. No validation is reported of the reward model's prediction accuracy on the actual policy outputs (e.g., correlation or error metrics between predicted and true iteration counts on policy samples). This mismatch risks optimizing a misaligned surrogate, undermining the claim that gains arise from shaping error direction as predicted by the bound.

Authors: We acknowledge the importance of verifying reward-model alignment on policy-generated samples. In the revision we will add a dedicated validation subsection reporting Pearson correlation and mean absolute error between the learned reward predictions and true Newton-Raphson iteration counts on warm-start vectors sampled from the trained GRPO policy (both during and after training). These metrics will be computed on held-out snapshots from the IEEE 118-bus and GOC benchmarks to confirm that the surrogate remains sufficiently accurate outside the original perturbation distribution. revision: yes
Referee: [Empirical results (likely Table 1 or §5)] The headline result that Newton's Lantern is the only method converging on every test snapshot with the smallest mean iteration count lacks reported error bars, standard deviations, or statistical significance tests across the multiple benchmarks. Without these, it is difficult to assess whether the observed superiority is robust or could be explained by training variance.

Authors: We will revise the experimental section and Table 1 to include error bars (standard deviation over 5 independent training seeds), per-benchmark standard deviations, and statistical significance tests (paired t-tests and Wilcoxon signed-rank tests with p-values) comparing Newton's Lantern against all baselines. These additions will demonstrate that the reported convergence on all snapshots and lowest mean iteration counts are robust to training stochasticity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation and empirical claims rely on external observables

full rationale

The claimed lower bound on Newton-Raphson iteration count is presented as a mathematical result depending on warm-start error direction and Jacobian singular values, independent of the RL pipeline. The reward model is trained using actual observed iteration counts (an external, non-fitted quantity) as labels on perturbations of the base predictor; the policy then optimizes against this surrogate, with final performance measured directly on true convergence and iteration counts across held-out benchmarks. No equation or step reduces the performance claims to a self-referential fit, renaming, or self-citation chain; the method remains falsifiable against the true solver behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard properties of the power-flow Jacobian and Newton-Raphson convergence theory (standard_math). No free parameters or new physical entities are explicitly introduced in the provided text; the RL components likely contain typical hyperparameters but are not detailed.

axioms (1)

standard math Newton-Raphson iteration count is a well-defined, observable function of the warm-start error vector and the power-flow Jacobian.
Invoked when stating the lower bound and when using iteration count as reward.

pith-pipeline@v0.9.0 · 5480 in / 1488 out tokens · 59591 ms · 2026-05-13T07:05:24.835975+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Theorem 3.1 (Lower bound on NR iterations by direction)... Λ(v;D) is a discounted average of log∥QD(·)∥ along the orbit of v under the Newton-direction map
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear
We adopt group relative policy optimization... reward model Rϕ trained on perturbations of the base model's predictions, using the iteration count itself as the supervisory signal

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

[1]

The continuation power flow: A tool for steady state voltage stability analysis

Venkataramana Ajjarapu and Colin Christy. The continuation power flow: A tool for steady state voltage stability analysis. IEEE Transactions on Power Systems, 7 0 (1): 0 416--423, 1992

work page 1992
[2]

Proximal Policy Optimization with Graph Neural Networks for Optimal Power Flow,

Steven de Jongh, Frederik Mueller, Michael Suriyah, and Thomas Leibfried. Proximal policy optimization with graph neural networks for optimal power flow. In 12th International Conference on Data Science, Technology and Applications (DATA), 2023. arXiv:2212.12470

work page arXiv 2023
[3]

N ewton Methods for Nonlinear Problems: Affine Invariance and Adaptive Algorithms

Peter Deuflhard. N ewton Methods for Nonlinear Problems: Affine Invariance and Adaptive Algorithms . Springer, 2004

work page 2004
[4]

Warm-starting AC optimal power flow with graph neural networks

Florian Diehl. Warm-starting AC optimal power flow with graph neural networks. In NeurIPS Workshop on Tackling Climate Change with Machine Learning, 2019

work page 2019
[5]

Observations on the geometry of saddle node bifurcation and voltage collapse in electrical power systems

Ian Dobson. Observations on the geometry of saddle node bifurcation and voltage collapse in electrical power systems. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 39 0 (3): 0 240--243, 1992

work page 1992
[6]

Morison, and Prabha Kundur

Bei Gao, Graham K. Morison, and Prabha Kundur. Voltage stability evaluation using modal analysis. IEEE Transactions on Power Systems, 7 0 (4): 0 1529--1542, 1992

work page 1992
[7]

How to find all roots of complex polynomials by N ewton's method

John Hubbard, Dierk Schleicher, and Scott Sutherland. How to find all roots of complex polynomials by N ewton's method. Inventiones Mathematicae, 146: 0 1--33, 2001

work page 2001
[8]

A load flow calculation method for ill-conditioned power systems

Shinichi Iwamoto and Yasuo Tamura. A load flow calculation method for ill-conditioned power systems. IEEE Transactions on Power Apparatus and Systems, PAS-100 0 (4): 0 1736--1743, 1981

work page 1981
[9]

Quantum-enhanced reinforcement learning for accelerating N ewton- R aphson convergence with I sing machines: A case study for power flow analysis

Zeynab Kaseb et al. Quantum-enhanced reinforcement learning for accelerating N ewton- R aphson convergence with I sing machines: A case study for power flow analysis. arXiv preprint arXiv:2511.20237, 2025

work page arXiv 2025
[10]

C. T. Kelley. Solving Nonlinear Equations with N ewton's Method . SIAM, 2003

work page 2003
[11]

Review of machine learning techniques for optimal power flow

Hooman Khaloie, Mihaly Dolanyi, Jean-Fran c ois Toubeau, and Fran c ois Vall\'ee. Review of machine learning techniques for optimal power flow. Applied Energy, 388: 0 125637, 2025

work page 2025
[12]

Resilience analysis and cascading failure modeling of power systems under extreme temperatures

Seyyed Rashid Khazeiynasab and Junjian Qi. Resilience analysis and cascading failure modeling of power systems under extreme temperatures. Journal of Modern Power Systems and Clean Energy, 9 0 (6), 2021

work page 2021
[13]

Numerical polynomial homotopy continuation method to locate all the power flow solutions

Dhagash Mehta, Hung Dinh Nguyen, and Konstantin Turitsyn. Numerical polynomial homotopy continuation method to locate all the power flow solutions. IET Generation, Transmission & Distribution, 10 0 (12): 0 2972--2980, 2016

work page 2016
[14]

Okhuegbe, Adedasola A

Samuel N. Okhuegbe, Adedasola A. Ademola, and Yilu Liu. N ewton- R aphson AC power flow convergence based on deep learning initialization and homotopy continuation. IEEE Transactions on Industry Applications, 2024 a . doi:10.1109/TIA.2024.3514992

work page doi:10.1109/tia.2024.3514992 2024
[15]

Okhuegbe, Adedasola A

Samuel N. Okhuegbe, Adedasola A. Ademola, and Yilu Liu. A machine learning initializer for N ewton- R aphson AC power flow convergence. In 2024 IEEE Texas Power and Energy Conference (TPEC), pages 1--6, 2024 b

work page 2024
[16]

Ortega and Werner C

James M. Ortega and Werner C. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables. SIAM, 2000. Reprint of Academic Press, 1970

work page 2000
[17]

CANOS : A fast and scalable neural AC - OPF solver robust to N-1 perturbations

Luis Piloto, Sofia Liguori, Sephora Madjiheurem, Miha Zgubic, Sean Lovett, Hamish Tomlinson, Sophie Elster, Chris Apps, and Sims Witherspoon. CANOS : A fast and scalable neural AC - OPF solver robust to N-1 perturbations. arXiv preprint arXiv:2403.17660, 2024

work page arXiv 2024
[18]

Rivera, Anvita Bhagavathula, Alvaro Carbonero, and Priya Donti

Ana K. Rivera, Anvita Bhagavathula, Alvaro Carbonero, and Priya Donti. PF : A benchmark dataset for power flow under load, generation, and topology variations. In Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[19]

Sauer and M

Peter W. Sauer and M. A. Pai. Power system steady-state stability and the load-flow J acobian. IEEE Transactions on Power Systems, 5 0 (4): 0 1374--1383, 1990

work page 1990
[20]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

G. W. Stewart and Ji-guang Sun. Matrix Perturbation Theory. Academic Press, 1990

work page 1990
[23]

Review of load-flow calculation methods

Brian Stott. Review of load-flow calculation methods. Proceedings of the IEEE, 62 0 (7): 0 916--929, 1974

work page 1974
[24]

Thorp and Sajid A

James S. Thorp and Sajid A. Naqavi. Load-flow fractals. In Proceedings of the 28th IEEE Conference on Decision and Control, pages 1822--1827, 1989

work page 1989
[25]

Tinney and Clifford E

William F. Tinney and Clifford E. Hart. Power flow solution by N ewton's method. IEEE Transactions on Power Apparatus and Systems, PAS-86 0 (11): 0 1449--1460, 1967

work page 1967
[26]

Tiranuchit and Robert J

A. Tiranuchit and Robert J. Thomas. A posturing strategy against voltage instabilities in electric power systems. IEEE Transactions on Power Systems, 3 0 (1): 0 87--93, 1988

work page 1988
[27]

Data driven approach towards more efficient N ewton- R aphson power flow calculation for distribution grids

Shengyuan Yan et al. Data driven approach towards more efficient N ewton- R aphson power flow calculation for distribution grids. arXiv preprint arXiv:2504.11650, 2025

work page arXiv 2025