Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering
Pith reviewed 2026-06-26 16:56 UTC · model grok-4.3
The pith
The paper introduces an executable benchmark where AI agents receive structured power engineering tasks and return solutions that are checked by deterministic code for feasibility and violations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that an executable benchmark consisting of 41 task families with deterministic evaluators can assess power-engineering agents by validating their structured outputs against engineering constraints, returning explicit feasibility, scores, and violations, while allowing future upgrades to simulator-backed checks without altering the task interface.
What carries the argument
The Power Systems Agent Benchmark, which pairs structured tasks with deterministic evaluators that recompute quantities and check constraints to produce feasibility flags, scores, and violations.
If this is right
- Agents receive concrete scores based on whether their solutions satisfy engineering constraints rather than on the quality of their explanations.
- The same task format can later support evaluator upgrades to full simulators without changing how agents are instructed or how solutions are submitted.
- Unanimous failures across multiple agents can flag defects in individual tasks or evaluators for correction.
- Held-out instances generated from private seeds allow measurement of generalization separate from public-split performance.
Where Pith is reading between the lines
- The approach could be adapted to create similar executable benchmarks in other engineering domains that rely on quantitative checks.
- If the benchmark correlates with real-world performance, it could guide development of agents that integrate with domain-specific engineering software.
- Public consistency between reference and held-out results suggests the generation method successfully resists contamination while remaining reproducible.
Load-bearing premise
The 41 task families and their deterministic evaluators are representative enough of real power engineering problems to act as valid proxies for feasibility.
What would settle it
An experiment in which agents that score highly on the benchmark perform poorly when applied to actual power system operations or when the same tasks are solved using full power-system simulators.
Figures
read the original abstract
Executable evaluation -- checking the consequences of an agent's actions with a program rather than grading its prose -- has become a prominent way to assess tool-using AI agents in software settings. Electric power engineering has not yet had an analogous benchmark: language-model use is still dominated by retrieval and text question answering, while agents acting on power-system artifacts remain mostly academic prototypes. We introduce the Power Systems Agent Benchmark, an executable benchmark for power-engineering agents. An agent receives a structured task and returns a structured solution; a deterministic evaluator recomputes the engineering quantities, checks operational constraints, and returns a feasibility flag, a normalized score, and explicit violations. The benchmark contains 41 task families across eight areas of power engineering, from power flow and protection to stability, microgrids, reliability, power quality, and forecasting. Each task is grounded in a citable source, standard, or documented engineering formulation. To resist contamination, held-out cases are synthesized on demand by per-family generators from private seeds: the construction is inspectable, but the instances remain private. In a reference evaluation with three command-line agents, the strongest score near the compact tier's ceiling, a smaller open model trails, and public and held-out performance are broadly consistent; a separate public-split grid with OpenCode and Aider probes harness effects. The reference evaluation doubles as quality control: unanimous failures flag candidate task or evaluator defects, and it exposed a latent evaluator bug missed by self-consistency checks. The evaluators are compact deterministic surrogates, but the task contract allows their internals to be upgraded to simulator-backed checks without changing how tasks are posed or solved.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Power Systems Agent Benchmark, an executable benchmark for AI agents in electric power engineering. Agents receive structured tasks across 41 families in eight areas (power flow, protection, stability, etc.) and return structured solutions; deterministic evaluators recompute quantities, check constraints, and output feasibility flags, normalized scores, and violations. Tasks are grounded in citable sources/standards, with on-demand synthesis from private seeds for held-out instances to resist contamination. A reference evaluation with three command-line agents is reported, along with quality control via unanimous agent failures that exposed an evaluator bug; evaluators are described as compact deterministic surrogates that can later be upgraded to simulator-backed checks without altering the task interface.
Significance. If the benchmark's evaluators prove reliable, this work supplies a much-needed executable evaluation framework for a domain where AI use remains largely limited to retrieval and text QA. Credit is due for the contamination-resistant design (on-demand synthesis from private seeds), the explicit grounding in external standards, the quality-control mechanism that detected a latent bug, and the forward-compatible task contract that permits evaluator upgrades. These features position the artifact as a reusable, inspectable starting point rather than a one-off leaderboard.
major comments (1)
- [Abstract and Reference Evaluation description] The central claim that the benchmark supplies valid executable evaluation rests on the 41 task families' deterministic evaluators correctly recomputing quantities and enforcing constraints. However, the manuscript reports no direct comparison, error-bound analysis, or validation of these compact surrogates against established full simulators (e.g., pandapower or PSSE). Quality control via unanimous failures only flags gross bugs, not systematic approximation errors in the engineering formulations. This leaves the proxy-validity assumption untested and directly affects whether benchmark outputs constitute reliable engineering assessment.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the benchmark's design features, including contamination resistance, grounding in standards, and the quality-control mechanism. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract and Reference Evaluation description] The central claim that the benchmark supplies valid executable evaluation rests on the 41 task families' deterministic evaluators correctly recomputing quantities and enforcing constraints. However, the manuscript reports no direct comparison, error-bound analysis, or validation of these compact surrogates against established full simulators (e.g., pandapower or PSSE). Quality control via unanimous failures only flags gross bugs, not systematic approximation errors in the engineering formulations. This leaves the proxy-validity assumption untested and directly affects whether benchmark outputs constitute reliable engineering assessment.
Authors: We agree that the manuscript provides no direct numerical comparison or error-bound analysis of the compact evaluators against full simulators such as pandapower or PSSE. Each evaluator implements a deterministic version of the engineering calculation drawn from the citable source or standard listed for its task family; the formulations are therefore transparent and inspectable rather than black-box approximations. Nevertheless, the absence of an empirical validation study against established simulators leaves the magnitude of any systematic discrepancy unquantified, which is a genuine limitation for claims of proxy validity. In the revised manuscript we will add an explicit Limitations subsection that states this gap, reiterates the forward-compatible task contract that permits future replacement by simulator-backed evaluators, and outlines a concrete plan for such validation on a representative subset of task families. revision: yes
Circularity Check
No circularity: benchmark is a new artifact grounded in external sources
full rationale
The paper introduces the Power Systems Agent Benchmark as a new executable evaluation artifact. Tasks are explicitly grounded in citable external standards or documented engineering formulations, with held-out cases generated on demand from private seeds. Deterministic evaluators are described as compact surrogates that can be upgraded without changing task contracts. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The central construction does not reduce to its own inputs by definition or by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Each task is grounded in a citable source, standard, or documented engineering formulation.
Reference graph
Works this paper leans on
-
[1]
Muhammad Sarwar, Muhammad Rizwan, Mubushra Aziz, and Abdul Rehman Sudais. Large language models for power system applications: A comprehensive literature survey.arXiv preprint arXiv:2512.13004, 2025
arXiv 2025
-
[2]
Soham Ghosh and Gaurav Mittal. Agentic AI systems in electrical power systems engineering: Current state-of-the-art and challenges.arXiv preprint arXiv:2511.14478, 2025
arXiv 2025
-
[3]
Hongwei Jin, Kibaek Kim, and Jonghwan Kwon. Gridmind: LLMs-powered agents for power system analysis and operations.arXiv preprint arXiv:2509.02494, 2025
arXiv 2025
-
[4]
Yihan Wen and Xin Chen. X-gridagent: An LLM-powered agentic AI system for assisting power grid analysis.arXiv preprint arXiv:2512.20789, 2025
arXiv 2025
-
[5]
Judging LLM-as-a-judge with MT-bench and chatbot arena
Lianmin Zheng et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. InNeurIPS Datasets and Benchmarks, 2023. arXiv:2306.05685
Pith/arXiv arXiv 2023
-
[6]
Largelanguagemodelsarenotfairevaluators.arXiv preprint arXiv:2305.17926, 2023
PeiyiWangetal. Largelanguagemodelsarenotfairevaluators.arXiv preprint arXiv:2305.17926, 2023
Pith/arXiv arXiv 2023
-
[7]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024. arXiv:2310.06770
Pith/arXiv arXiv 2024
-
[8]
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024
Pith/arXiv arXiv 2024
-
[9]
Terminal-bench: Benchmarking ai agents on realistic terminal tasks
Laude Institute and Stanford University. Terminal-bench: Benchmarking ai agents on realistic terminal tasks. https://www.tbench.ai/, 2025
2025
-
[10]
Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
Mark Chen et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
Pith/arXiv arXiv 2021
-
[11]
Jun Shern Chan et al. MLE-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024
Pith/arXiv arXiv 2024
-
[12]
Agentbench: Evaluating LLMs as agents
Xiao Liu et al. Agentbench: Evaluating LLMs as agents. InICLR, 2024. arXiv:2308.03688
Pith/arXiv arXiv 2024
-
[13]
GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023
Pith/arXiv arXiv 2023
-
[14]
Webarena: A realistic web environment for building autonomous agents
Shuyan Zhou et al. Webarena: A realistic web environment for building autonomous agents. In ICLR, 2024. arXiv:2307.13854. 14
Pith/arXiv arXiv 2024
-
[15]
Agent-as-a-judge: Evaluate agents with agents
Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. Agent-as-a-judge: Evaluate agents with agents. arXiv preprint arXiv:2410.10934, 2024
arXiv 2024
-
[16]
Fangyi Yu. When AIs judge AIs: The rise of agent-as-a-judge evaluation for LLMs.arXiv preprint arXiv:2508.02994, 2025
arXiv 2025
-
[17]
PowerAgentBench: Standardized tasks, environments, and metrics for power-system agents
PowerAgent community, Harvard SEAS. PowerAgentBench: Standardized tasks, environments, and metrics for power-system agents. GitHub repository, Power-Agent/PowerAgentBench, 2026. URL https://github.com/Power-Agent/PowerAgentBench. Benchmark component of the PowerAgent ecosystem (poweragent.seas.harvard.edu); no dedicated publication at the time of writing
2026
-
[18]
Poweragent: A road map toward agentic intelligence in power systems: Foundation model, model context protocol, and workflow.IEEE Power & Energy Magazine, 23(5):93–101, 2025
Qian Zhang and Le Xie. Poweragent: A road map toward agentic intelligence in power systems: Foundation model, model context protocol, and workflow.IEEE Power & Energy Magazine, 23(5):93–101, 2025
2025
-
[19]
Elecbench: a power dispatch evaluation benchmark for large language models
Xiyuan Zhou et al. Elecbench: a power dispatch evaluation benchmark for large language models. InIEEE PES General Meeting, 2025. arXiv:2407.05365; Best Paper
arXiv 2025
-
[20]
IEEE DataPort, 2025
PFBench: Power-flow benchmark for LLM-based power system agent evaluation. IEEE DataPort, 2025. URL https://ieee-dataport.org/documents/power-flow-benchmark-llm-based- power-system-agent-evaluation-pfbench
2025
-
[21]
Mohamed Shamseldein. Grid-mind: An LLM-orchestrated multi-fidelity agent for automated connection impact assessment.arXiv preprint arXiv:2602.20683, 2026
arXiv 2026
-
[22]
Buxin She, Brian Chen, Luanzheng Guo, and Fangxing Li. PFAgent: A tractable and self- evolving power-flow agent for interactive grid analysis.arXiv preprint arXiv:2604.10846, 2026
Pith/arXiv arXiv 2026
-
[23]
Antoine Marot, Benjamin Donnot, Camilo Romero, Balthazar Donon, Marvin Lerousseau, Luca Veyrin-Forrer, and Isabelle Guyon. Learning to run a power network challenge for training topology controllers.Electric Power Systems Research, 189, 2020. arXiv:1912.04211
arXiv 2020
-
[24]
Learning to run a power network challenge: a retrospective analysis
Antoine Marot, Benjamin Donnot, Gabriel Dulac-Arnold, Adrian Kelly, Aïdan O’Sullivan, Jan Viebahn, Mariette Awad, Isabelle Guyon, Patrick Panciatici, and Camilo Romero. Learning to run a power network challenge: a retrospective analysis. InNeurIPS 2020 Competition and Demonstration Track, PMLR v133, pages 112–132, 2021
2020
-
[25]
pandapower — an open-source python tool for convenient modeling, analysis, and optimization of electric power systems.IEEE Transactions on Power Systems, 33(6):6510–6521, 2018
Leon Thurner, Alexander Scheidler, Florian Schäfer, Jan-Hendrik Menke, Julian Dollichon, Friederike Meier, Steffen Meinecke, and Martin Braun. pandapower — an open-source python tool for convenient modeling, analysis, and optimization of electric power systems.IEEE Transactions on Power Systems, 33(6):6510–6521, 2018
2018
-
[26]
Zimmerman, Carlos E
Ray D. Zimmerman, Carlos E. Murillo-Sánchez, and Robert J. Thomas. MATPOWER: Steady- state operations, planning, and analysis tools for power systems research and education.IEEE Transactions on Power Systems, 26(1):12–19, 2011
2011
-
[27]
Dugan and Thomas E
Roger C. Dugan and Thomas E. McDermott. An open source platform for collaborating on smart grid research. InIEEE PES General Meeting, 2011. 15
2011
-
[28]
Powermodels.jl: Anopen-sourceframeworkforexploringpowerflowformulations
Carleton Coffrin, Russell Bent, Kaarthik Sundar, Yeesian Ng, and Miles Lubin. Powermodels.jl: Anopen-sourceframeworkforexploringpowerflowformulations. InPower Systems Computation Conference (PSCC), 2018
2018
-
[29]
Sogol Babaeinejadsarookolaee et al. The power grid library for benchmarking AC optimal power flow algorithms.arXiv preprint arXiv:1908.02788, 2019
arXiv 1908
-
[30]
Hybrid symbolic-numeric framework for power system modeling and analysis.IEEE Transactions on Power Systems, 36(2):1373–1384, 2021
Hantao Cui, Fangxing Li, and Kevin Tomsovic. Hybrid symbolic-numeric framework for power system modeling and analysis.IEEE Transactions on Power Systems, 36(2):1373–1384, 2021
2021
-
[31]
PyPSA: Python for power system analysis.Journal of Open Research Software, 6(1), 2018
Tom Brown, Jonas Hörsch, and David Schlachtberger. PyPSA: Python for power system analysis.Journal of Open Research Software, 6(1), 2018
2018
-
[32]
A survey on data contamination for large language models.arXiv preprint arXiv:2502.14425, 2025
Yuxing Cheng, Yi Chang, and Yuan Wu. A survey on data contamination for large language models.arXiv preprint arXiv:2502.14425, 2025
arXiv 2025
-
[33]
Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, and Baishakhi Ray. Recent advances in large language model benchmarks against data contamination: From static to dynamic evaluation.arXiv preprint arXiv:2502.17521, 2025
arXiv 2025
-
[34]
Siegel, Nitya Nadgir, and Arvind Narayanan
Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter.arXiv preprint arXiv:2407.01502, 2024
arXiv 2024
-
[35]
Le, Christopher Ré, and Azalia Mirhoseini
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024
Pith/arXiv arXiv 2024
-
[36]
Graph computing based fast screening in contingency analysis.arXiv preprint arXiv:1904.00044, 2019
Yiting Zhao, Chen Yuan, Sun Li, Guangyi Liu, Renchang Dai, and Zhiwei Wang. Graph computing based fast screening in contingency analysis.arXiv preprint arXiv:1904.00044, 2019
Pith/arXiv arXiv 1904
-
[37]
IEEE Std 738-2012 standard for calculating the current-temperature relationship of bare overhead conductors, 2012
2012
-
[38]
systems — part 0: Calculation of currents, 2016
IEC 60909-0:2016 short-circuit currents in three-phase a.c. systems — part 0: Calculation of currents, 2016
2016
-
[39]
Anderson.Analysis of Faulted Power Systems
Paul M. Anderson.Analysis of Faulted Power Systems. Wiley-IEEE Press, 1995. ISBN 978-0-7803-1145-9
1995
-
[40]
Horowitz and Arun G
Stanley H. Horowitz and Arun G. Phadke.Power System Relaying. Wiley, 4th edition, 2014. ISBN 978-1-118-66200-7
2014
-
[41]
IEC 60364-5-52 low-voltage electrical installations — selection and erection of electrical equip- ment — wiring systems, 2009
2009
-
[42]
McGraw-Hill, 1994
Prabha Kundur.Power System Stability and Control. McGraw-Hill, 1994
1994
-
[43]
IEEE Std 2800-2022 standard for interconnection and interoperability of inverter-based resources interconnecting with associated transmission electric power systems, 2022
2022
-
[44]
ENTSO-E network code on requirements for grid connection of generators (RfG), 2016
2016
-
[45]
Konstantin Turitsyn, Petr Šulc, Scott Backhaus, and Michael Chertkov. Local control of reactive power by distributed photovoltaic generators.arXiv preprint arXiv:1006.0160, 2010. 16
Pith/arXiv arXiv 2010
-
[47]
Shiva Poudel and Anamika Dubey. A two-stage service restoration method for electric power distribution systems.arXiv preprint arXiv:2004.07921, 2020
arXiv 2004
-
[48]
IEEE Std 1366-2012 guide for electric power distribution reliability indices, 2012
2012
-
[49]
EN 50160:2010 voltage characteristics of electricity supplied by public distribution networks, 2010
2010
-
[50]
IEEE Std 519-2014 recommended practice and requirements for harmonic control in electric power systems, 2014
2014
-
[51]
IEC 60076-7:2018 power transformers — part 7: Loading guide for mineral-oil-immersed power transformers, 2018
2018
-
[52]
Marcel Dekker, 2004
Ali Abur and Antonio Gómez Expósito.Power System State Estimation: Theory and Imple- mentation. Marcel Dekker, 2004
2004
-
[53]
Yao Liu, Peng Ning, and Michael K. Reiter. False data injection attacks against state estimation in electric power grids.ACM Transactions on Information and System Security, 14(1), 2011. First presented at ACM CCS 2009
2011
-
[54]
Harbor: A framework for running agent evaluations and rl environments
Terminal-Bench Team. Harbor: A framework for running agent evaluations and rl environments. https://github.com/harbor-framework/harbor, 2026. Container-based agent-evaluation harness released with Terminal-Bench 2.0. Appendix A. Task Catalog The 41 families are listed below by domain area, each with its primary source or governing standard and the confide...
2026
-
[56]
Short circuit and protection Family Source / standard Confidence three_phase_short_circuitIEC 60909 high earth_fault_calculation Anderson, Analysis of Faulted Power Systems medium breaker_relay_short_circuit IEC 60909; Glover, Overbye & Sarmahigh 17 Family Source / standard Confidence distance_protection_settingsHorowitz & Phadke, Power System Relaying me...
-
[57]
Stability, grid code, and inverter-based resources Family Source / standard Confidence critical_clearing_timeEqual-area criterion (Kundur) high transient_stability_predictionChen et al., transient-stability prediction medium frt_complianceIEEE 2800 / ENTSO-E RfG (FRT) medium ibr_short_circuit_frt IBR modeling for short circuit / FRTmedium min_synchronous_...
-
[58]
Distributed resources, PV, EV, and storage Family Source / standard Confidence pv_volt_varTuritsyn et al., local Volt-VAR control medium ev_v2g_outage_scheduleEVs for power quality & security medium ev_v2g_voltage_supportEVs for power quality & security medium bess_ancillary_responseGonzalez-Longatt & Rueda Torres medium commercial_pv_lcoe_uncertaintyPV s...
-
[59]
Microgrids and dispatch Family Source / standard Confidence microgrid_economic_dispatchEspaña et al., microgrid dispatch medium rolling_microgrid_dispatchEspaña et al., microgrid dispatch medium islanded_microgrid_pq_dispatchEspaña et al., microgrid dispatch medium dispatch_uncertaintyChung, advanced prediction for smart grids low hydro_thermal_storage_uc...
arXiv 2001
-
[60]
Reliability and restoration Family Source / standard Confidence flisr_restorationTwo-stage distribution service restoration medium fault_section_localizationBrown, faulted-circuit indicators medium fci_placementBrown, faulted-circuit indicators medium fci_saidi_caidiBrown, FCIs; IEEE 1366 indices medium operator_breaker_load_actions Glover, Overbye & Sarm...
-
[61]
Power quality, standards, assets, and cybersecurity Family Source / standard Confidence en50160_voltage_complianceEN 50160 high power_quality_event_classificationIEC 61000-4-30 high harmonic_ieee519_complianceIEEE 519 high transformer_thermal_loadingIEC 60076-7 high fdi_state_estimationAbur & Exposito, state estimation high protected_meter_placementGraphi...
-
[62]
Forecasting under uncertainty Family Source / standard Confidence wind_power_forecastSafari et al., short-term wind forecasting medium wind_prediction_intervalKhorramdel et al., fuzzy prediction intervals medium Appendix B. Experiment Artifacts Run configuration.Each agent was invoked once per case through its own command-line interface under a 600-second...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.