arxiv: 2604.26150 · v1 · submitted 2026-04-28 · 🧮 math.OC

Recognition: unknown

Reinforcement Learning for Public Safety Power Shutoffs Under Decision-Dependent Uncertainty and Nonlinear Wildfire Ignition Models

Prasanna Raut , Chaoyue Zhao , Alexandre Moreira

Authors on Pith no claims yet

Pith reviewed 2026-05-07 14:54 UTC · model grok-4.3

classification 🧮 math.OC

keywords reinforcement learningpublic safety power shutoffswildfire ignition modelsdistribution system topologydecision-dependent uncertaintyproximal policy optimizationpower grid operations

0 comments

The pith

Reinforcement learning optimizes public safety power shutoffs by training directly on flexible wildfire ignition simulators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how a reinforcement learning agent using Proximal Policy Optimization can learn to de-energize parts of a power distribution network to prevent wildfires. It does this by interacting repeatedly with a simulator that accepts any line failure probability model, including nonlinear ones that depend on the shutoff decisions themselves. Traditional mixed-integer programming methods require simplified assumptions about those probabilities to stay solvable, which can make the plans less accurate in reality. If the learned policies transfer well, they would let utilities reduce the costs imposed on communities by shutoffs while still addressing ignition risks more realistically. Tests on 54-bus and 138-bus networks indicate lower total operational costs than existing methods, with computation time growing only modestly as the network enlarges.

Core claim

The central claim is that a Proximal Policy Optimization reinforcement learning framework can learn distribution system topology adjustments by direct interaction with a simulator of decision-dependent wildfire ignition, without the restrictive structural assumptions required by mixed-integer programming formulations, and that this yields lower operational costs on 54-bus and 138-bus test systems while scaling computationally.

What carries the argument

A Proximal Policy Optimization agent that repeatedly selects topology changes and receives rewards based on simulated wildfire ignition outcomes and community costs, trained against any chosen nonlinear line-failure probability model.

If this is right

Utilities could adopt more realistic nonlinear ignition models when planning shutoffs instead of being limited to tractable simplifications.
Operational costs from de-energizing lines can be lowered while still controlling ignition risk on the tested network sizes.
Computation time remains manageable as network size increases, supporting use on larger real-world distribution systems.
Policies can be trained offline in simulation before any live deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Better wildfire simulators could become a shared resource for training such agents across utilities.
The same reinforcement learning setup might extend to other grid decisions where failures depend on prior actions, such as maintenance scheduling.
Real deployment would still need separate validation that simulator-trained policies remain safe under unmodeled conditions like weather extremes.

Load-bearing premise

The simulator must faithfully reproduce how real-world decisions change the probabilities of power lines igniting wildfires.

What would settle it

Deploying the learned policies on an actual distribution system and observing either substantially higher wildfire ignition rates or higher total costs than the simulator predicted would disprove the practical value of the approach.

Figures

Figures reproduced from arXiv: 2604.26150 by Alexandre Moreira, Chaoyue Zhao, Prasanna Raut.

**Figure 1.** Figure 1: Comparison of line failure probability models as a view at source ↗

**Figure 2.** Figure 2: Single-line diagram of the 54-bus distribution system view at source ↗

**Figure 3.** Figure 3: Distribution of power flow magnitudes (as a percentage of line capacity) for wildfire-area lines under the three policies view at source ↗

**Figure 4.** Figure 4: Single-line diagram of the 138-bus distribution system. The Wildfire-affected and high-threat is highlighted in red. The view at source ↗

**Figure 5.** Figure 5: Power flow distribution for wildfire-area lines on the view at source ↗

read the original abstract

Power grid infrastructure is an increasingly significant source of wildfire ignitions and poses severe risks to communities in fire-prone regions. Public Safety Power Shutoffs (PSPS) have emerged as a critical operational tool for utilities to mitigate this risk by proactively de-energizing portions of the grid under high-threat conditions. These shutoffs, however, impose costs on affected communities, and it is therefore essential that PSPS decisions be informed by realistic models of wildfire ignition risk. Current Mixed Integer Programming based methods require restrictive structural assumptions about the probability models for line failures caused by power line ignitions. While these simplifications yield tractable solutions, the resulting models may differ significantly from the true underlying dynamics. In this paper, we propose a reinforcement learning framework based on Proximal Policy Optimization that learns to adjust the topology of a distribution system by interacting directly with a simulator that accommodates any line failure probability model without imposing such restrictions. We test our methodology on 54-bus and 138-bus distribution systems and demonstrate its ability to lower operational costs compared to existing methods while allowing only marginally increased compute times as network size grows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PPO with a flexible simulator lets them drop MIP restrictions on wildfire models for PSPS, and they report lower costs on 54- and 138-bus cases, but the experiments give almost no evidence the simulator matches reality or that the gains survive model error.

read the letter

The paper's main move is to train a PPO policy that picks which lines to de-energize by querying an arbitrary simulator for ignition probabilities instead of forcing the failure model into a form MIP solvers can handle. That removes a real modeling constraint that earlier PSPS work had to live with, and the abstract claims the resulting policies cut operational costs on the two test networks while compute stays reasonable as size grows. The approach is straightforward: the agent sees the current topology and threat level, acts by opening switches, and gets reward based on the simulator's ignition and load-shed outcomes. That setup is new enough in this specific combination to be worth noting for people who already use simulation in grid operations. What they do cleanly is keep the method general; nothing in the algorithm itself assumes linearity or convexity in the ignition function. They also show the policy can be trained once and then used quickly, which matters for operational time scales. The soft spots sit in the evidence. The abstract mentions cost reductions but supplies no baselines beyond the MIP methods that only work under the restrictive models, no error bars, no sensitivity checks on simulator parameters, and no comparison against historical ignition records. The stress-test point holds: if the simulator's decision-dependent probabilities are off, the learned policy has no demonstrated robustness, and nothing in the reported results quantifies how much advantage remains once you move to truly nonlinear cases that MIP cannot solve. The central claim therefore rests on untested transfer from simulation to field. This is for researchers in power-system optimization and wildfire-risk modeling who are already comfortable with RL or black-box simulators. It is not yet ready for operators to adopt, but the framing is honest about the modeling gap it tries to close. I would send it to peer review; the idea is clear, the problem is live, and a referee can push on the missing validation without the paper collapsing.

Referee Report

3 major / 1 minor

Summary. The paper proposes a Proximal Policy Optimization (PPO) reinforcement learning framework for Public Safety Power Shutoffs (PSPS) that interacts directly with a black-box simulator to optimize distribution system topology under arbitrary (including nonlinear) decision-dependent line failure probability models for wildfire ignition. Unlike Mixed Integer Programming (MIP) approaches that require restrictive structural assumptions on the probability models, the RL method is tested on 54-bus and 138-bus systems and claims lower operational costs with only marginal growth in compute time as network size increases.

Significance. If the experimental claims hold under rigorous validation, the work would be significant for enabling PSPS decisions with more realistic nonlinear ignition models that MIP methods cannot tractably handle. The core strength is the simulator-interaction approach that avoids self-referential parameter restrictions and allows general failure models. However, the current lack of experimental details, real-data calibration, and robustness checks substantially reduces the assessed significance.

major comments (3)

[Abstract] Abstract: the central performance claims of cost reductions on the 54-bus and 138-bus systems are presented without any description of the experimental setup, baselines employed, number of runs, error bars, or statistical significance testing; this information is load-bearing for evaluating whether the RL policy actually outperforms MIP under the claimed general (nonlinear) models.
[Results] Results section: comparisons to existing MIP methods are reported, but it is not specified whether these baselines were run under the same nonlinear wildfire ignition models or only under the restrictive (linear/convex) models solvable by MIP; without this distinction the superiority claim under arbitrary models cannot be assessed.
[Simulator and experimental design] Simulator and experimental design: the framework's ability to accommodate any failure model rests on direct simulator interaction, yet no calibration to historical ignition data, sensitivity analysis to model misspecification, or out-of-distribution generalization tests for the learned PPO policy are provided; these omissions directly undermine the practical transferability asserted in the abstract.

minor comments (1)

[Abstract] Abstract: the phrase 'marginally increased compute times as network size grows' is stated without quantitative scaling data or explicit comparison tables, reducing clarity on the computational advantage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify important opportunities to strengthen the presentation of experimental results and clarify the scope of our claims. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims of cost reductions on the 54-bus and 138-bus systems are presented without any description of the experimental setup, baselines employed, number of runs, error bars, or statistical significance testing; this information is load-bearing for evaluating whether the RL policy actually outperforms MIP under the claimed general (nonlinear) models.

Authors: We agree that the abstract would benefit from additional context on the experimental protocol. In the revised version we will expand the abstract to note that MIP baselines are applied only under model assumptions they can accommodate, that results are averaged over multiple random seeds with reported variability, and that consistent cost reductions are observed. Full details on the 54-bus and 138-bus test cases, number of runs, and statistical procedures remain in Section 4. revision: yes
Referee: [Results] Results section: comparisons to existing MIP methods are reported, but it is not specified whether these baselines were run under the same nonlinear wildfire ignition models or only under the restrictive (linear/convex) models solvable by MIP; without this distinction the superiority claim under arbitrary models cannot be assessed.

Authors: The referee correctly notes the distinction. MIP formulations are tractable only for linear or convex ignition probability models; all reported MIP comparisons therefore use instances satisfying those assumptions. For general nonlinear models, MIP is intractable by construction, which is precisely the setting our simulator-based RL approach targets. We will revise the Results section to state this separation explicitly, add a clarifying paragraph, and include a supplementary table that reports RL performance on both linear and nonlinear instances while noting MIP intractability on the latter. revision: yes
Referee: [Simulator and experimental design] Simulator and experimental design: the framework's ability to accommodate any failure model rests on direct simulator interaction, yet no calibration to historical ignition data, sensitivity analysis to model misspecification, or out-of-distribution generalization tests for the learned PPO policy are provided; these omissions directly undermine the practical transferability asserted in the abstract.

Authors: We acknowledge that the current experiments employ synthetic nonlinear ignition models chosen to demonstrate generality rather than real-data calibration. In revision we will add a sensitivity analysis varying ignition probability parameters and will include a discussion of how the black-box simulator can be calibrated with historical records (e.g., from CAL FIRE). Comprehensive out-of-distribution tests on held-out real ignition events would require additional curated datasets beyond the scope of this methodological study and are noted as future work; we will make this limitation explicit. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard RL interaction with external simulator

full rationale

The paper's core claim is a PPO-based RL framework that learns topology adjustments via direct interaction with a black-box simulator accommodating arbitrary (including nonlinear) failure probability models. This structure is self-contained: the policy is trained against an independent simulator whose dynamics are not defined by the method itself, and performance is evaluated empirically on 54-bus and 138-bus test cases against MIP baselines. No equations or steps reduce by construction to fitted parameters, self-citations, or ansatzes imported from the authors' prior work. The derivation chain (state-action formulation, PPO updates, simulator rollouts) follows standard RL practice without renaming known results or smuggling assumptions via self-reference. Minor self-citation risk is absent from the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence and fidelity of a simulator that can handle arbitrary line failure probability models; no explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption A simulator exists that can accurately model any line failure probability without structural restrictions.
The RL agent learns by direct interaction with this simulator.

pith-pipeline@v0.9.0 · 5498 in / 1118 out tokens · 59610 ms · 2026-05-07T14:54:38.293532+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Wildfire and wildfire safety,

California Public Utilities Commission, “Wildfire and wildfire safety,” https://www.cpuc.ca.gov/industries-and-topics/wildfires, 2026, accessed: 2026-04-01

2026
[2]

Lahaina origin and cause report (fi23-0012446),

Fire and Public Safety Department, “Lahaina origin and cause report (fi23-0012446),” Maui County, Tech. Rep., April 2024. [Online]. Available: https://www.mauicounty.gov/DocumentCenter/ View/149693/FI23-0012446-Lahaina-Origin-and-Cause-Report Plus-Appendix-A-B-C-Redacted

2024
[3]

2024 report: A report to the house of representatives, 89th texas legislature,

Investigative Committee on the Panhandle Wildfires, “2024 report: A report to the house of representatives, 89th texas legislature,” Texas House of Representatives, Tech. Rep., May 1 2024. [Online]. Available: https://www.house.texas.gov/pdfs/committees/reports/interim/88interim/ House-Interim-Committee-on-The-Panhandle-Wildfires-Report.pdf

2024
[4]

The camp fire public report: A summary of the camp fire investigation,

Butte County District Attorney, “The camp fire public report: A summary of the camp fire investigation,” Office of the District Attorney, Butte County, Public Report, June 16 2020. [Online]. Available: https://www.buttecounty.net/DocumentCenter/View/1881/ Camp-Fire-Public-Report---Summary-of-the-Camp-Fire-Investigation-PDF

2020
[5]

Small vulnerable sets determine large network cascades in power grids,

Y . Yang, T. Nishikawa, and A. E. Motter, “Small vulnerable sets determine large network cascades in power grids,”Science, vol. 358, no. 6365, p. eaan3184, 2017

2017
[6]

Public safety power shutoffs (PSPS),

California Public Utilities Commission, “Public safety power shutoffs (PSPS),” https://www.cpuc.ca.gov/psps/, 2026, accessed: 2026-04-01

2026
[7]

California power shutoffs: Deficiencies in data and reporting,

M. Sotolongo, C. Bolon, and S. H. Baker, “California power shutoffs: Deficiencies in data and reporting,”Initiative for energy justice, 2020

2020
[8]

Optimal distribution system operation for enhancing resilience against wildfires,

D. N. Trakas and N. D. Hatziargyriou, “Optimal distribution system operation for enhancing resilience against wildfires,”IEEE Transactions on Power Systems, vol. 33, no. 2, pp. 2260–2271, 2017

2017
[9]

Resilient by design: Preventing wildfires and blackouts with micro- grids,

W. Yang, S. N. Sparrow, M. Ashtine, D. C. Wallom, and T. Morstyn, “Resilient by design: Preventing wildfires and blackouts with micro- grids,”Applied Energy, vol. 313, p. 118793, 2022

2022
[10]

Balancing wildfire risk and power outages through optimized power shut-offs,

N. Rhodes, L. Ntaimo, and L. Roald, “Balancing wildfire risk and power outages through optimized power shut-offs,”IEEE Transactions on Power Systems, vol. 36, no. 4, pp. 3118–3128, 2020

2020
[11]

Quasi second- order stochastic dominance model for balancing wildfire risks and power outages due to proactive public safety de-energizations,

J. Su, S. Mehrani, P. Dehghanian, and M. A. Lejeune, “Quasi second- order stochastic dominance model for balancing wildfire risks and power outages due to proactive public safety de-energizations,”IEEE Transactions on Power Systems, vol. 39, no. 2, pp. 2528–2542, 2023

2023
[12]

Decision-dependent uncertainty-aware distribution system planning un- der wildfire risk,

F. Pianc ´o, A. Moreira, B. Fanzeres, R. Jiang, C. Zhao, and M. Heleno, “Decision-dependent uncertainty-aware distribution system planning un- der wildfire risk,”IEEE Transactions on Power Systems, 2025

2025
[13]

Tree-related high-impedance fault in distribution systems: modeling, detection, and ignition risk assessment,

C. Yang, W. Zhang, R. Tang, and X. Xiao, “Tree-related high-impedance fault in distribution systems: modeling, detection, and ignition risk assessment,”Energies, vol. 18, no. 3, p. 548, 2025

2025
[14]

Distribution system operation amidst wildfire-prone climate conditions under decision-dependent line availability uncertainty,

A. Moreira, F. Pianc ´o, B. Fanzeres, A. Street, R. Jiang, C. Zhao, and M. Heleno, “Distribution system operation amidst wildfire-prone climate conditions under decision-dependent line availability uncertainty,”IEEE Transactions on Power Systems, vol. 39, no. 5, pp. 6522–6538, 2024

2024
[15]

Power distribution systems under wildfire risks: Chance-constrained model with decision-dependent probabilities,

S. Zhang, M. Lejeune, and P. Dehghanian, “Power distribution systems under wildfire risks: Chance-constrained model with decision-dependent probabilities,”Available at SSRN 5508618, 2025

2025
[16]

Characteriz- ing probability of wildfire ignition caused by power distribution lines,

J. W. Muhs, M. Parvania, H. T. Nguyen, and J. A. Palmer, “Characteriz- ing probability of wildfire ignition caused by power distribution lines,” IEEE Transactions on Power Delivery, vol. 36, no. 6, pp. 3681–3688, 2021

2021
[17]

Fire risk mitigation in the overhead electricity distribution network,

M. Van Der Linde, “Fire risk mitigation in the overhead electricity distribution network,” in2019 29th Australasian Universities Power Engineering Conference (AUPEC). IEEE, 2019, pp. 1–6

2019
[18]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1707.06347

work page internal anchor Pith review arXiv 2017
[19]

Deep reinforcement learning in power systems resilience: A review,

D. Cao, Y . Liu, Y . Wang, Q. Zhang, and W. Hu, “Deep reinforcement learning in power systems resilience: A review,”IEEE Transactions on Reliability, 2025

2025
[20]

Intelligent hur- ricane resilience enhancement of power distribution systems via deep reinforcement learning,

N. L. Dehghani, A. B. Jeddi, and A. Shafieezadeh, “Intelligent hur- ricane resilience enhancement of power distribution systems via deep reinforcement learning,”Applied energy, vol. 285, p. 116355, 2021

2021
[21]

Resilient operation of distribution grids using deep reinforcement learning,

M. M. Hosseini and M. Parvania, “Resilient operation of distribution grids using deep reinforcement learning,”IEEE Transactions on Indus- trial Informatics, vol. 18, no. 3, pp. 2100–2109, 2021

2021
[22]

Reinforcement-learning-based proactive con- trol for enabling power grid resilience to wildfire,

S. U. Kadir, S. Majumder, A. K. Srivastava, A. D. Chhokra, H. Neema, A. Dubey, and A. Laszka, “Reinforcement-learning-based proactive con- trol for enabling power grid resilience to wildfire,”IEEE Transactions on Industrial Informatics, vol. 20, no. 1, pp. 795–805, 2024

2024
[23]

Deep re- inforcement learning in large discrete action spaces,

G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap, J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin, “Deep re- inforcement learning in large discrete action spaces,”arXiv preprint arXiv:1512.07679, 2015

work page arXiv 2015
[24]

Learning values across many orders of magnitude,

H. van Hasselt, A. Guez, M. Hessel, V . Mnih, and D. Silver, “Learning values across many orders of magnitude,” inProceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16. Red Hook, NY , USA: Curran Associates Inc., 2016, p. 4294–4302

2016
[25]

Security-constrained design of isolated multi-energy microgrids,

S. Mashayekh, M. Stadler, G. Cardoso, M. Heleno, S. Chalil Madathil, H. Nagarajan, R. Bent, M. Mueller-Stoffels, X. Lu, and J. Wang, “Security-constrained design of isolated multi-energy microgrids,”IEEE Transactions on Power Systems, vol. PP, pp. 1–1, 08 2017

2017
[26]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,”CoRR, vol. abs/1506.02438, 2015. [Online]. Available: https://api.semanticscholar.org/CorpusID:3075448

work page internal anchor Pith review arXiv 2015
[27]

Distribution system operation amidst wildfire- prone climate conditions under decision-dependent line availability uncertainty - dataset,

A. Moreira, F. Pianc ´o, B. F. dos Santos, A. Street, R. Jiang, C. Zhao, and M. Heleno, “Distribution system operation amidst wildfire- prone climate conditions under decision-dependent line availability uncertainty - dataset,” 2023. [Online]. Available: https://dx.doi.org/10. 21227/318q-5k50

2023