Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering

Sergei Trashchenkov

arxiv: 2606.20950 · v1 · pith:NSOWVBLZnew · submitted 2026-06-18 · 💻 cs.AI · cs.SY· eess.SY

Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering

Sergei Trashchenkov This is my paper

Pith reviewed 2026-06-26 16:56 UTC · model grok-4.3

classification 💻 cs.AI cs.SYeess.SY

keywords power systemsAI agentsexecutable benchmarkdeterministic evaluationpower engineeringtask familiesfeasibility checkingconstraint validation

0 comments

The pith

The paper introduces an executable benchmark where AI agents receive structured power engineering tasks and return solutions that are checked by deterministic code for feasibility and violations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Power Systems Agent Benchmark as a way to evaluate AI agents in electric power engineering through executable checks instead of text grading. An agent is given a structured task and must return a structured solution; a program then recomputes engineering quantities, verifies operational constraints, and outputs a feasibility flag, normalized score, and list of violations. The benchmark covers 41 task families across eight areas including power flow, protection, stability, microgrids, reliability, power quality, and forecasting, with each task drawn from citable sources or standards. Tasks are generated on demand from private seeds to prevent contamination while remaining inspectable. A reference evaluation with command-line agents shows performance differences and also serves as a check for defects in the tasks or evaluators themselves.

Core claim

The central discovery is that an executable benchmark consisting of 41 task families with deterministic evaluators can assess power-engineering agents by validating their structured outputs against engineering constraints, returning explicit feasibility, scores, and violations, while allowing future upgrades to simulator-backed checks without altering the task interface.

What carries the argument

The Power Systems Agent Benchmark, which pairs structured tasks with deterministic evaluators that recompute quantities and check constraints to produce feasibility flags, scores, and violations.

If this is right

Agents receive concrete scores based on whether their solutions satisfy engineering constraints rather than on the quality of their explanations.
The same task format can later support evaluator upgrades to full simulators without changing how agents are instructed or how solutions are submitted.
Unanimous failures across multiple agents can flag defects in individual tasks or evaluators for correction.
Held-out instances generated from private seeds allow measurement of generalization separate from public-split performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be adapted to create similar executable benchmarks in other engineering domains that rely on quantitative checks.
If the benchmark correlates with real-world performance, it could guide development of agents that integrate with domain-specific engineering software.
Public consistency between reference and held-out results suggests the generation method successfully resists contamination while remaining reproducible.

Load-bearing premise

The 41 task families and their deterministic evaluators are representative enough of real power engineering problems to act as valid proxies for feasibility.

What would settle it

An experiment in which agents that score highly on the benchmark perform poorly when applied to actual power system operations or when the same tasks are solved using full power-system simulators.

Figures

Figures reproduced from arXiv: 2606.20950 by Sergei Trashchenkov.

read the original abstract

Executable evaluation -- checking the consequences of an agent's actions with a program rather than grading its prose -- has become a prominent way to assess tool-using AI agents in software settings. Electric power engineering has not yet had an analogous benchmark: language-model use is still dominated by retrieval and text question answering, while agents acting on power-system artifacts remain mostly academic prototypes. We introduce the Power Systems Agent Benchmark, an executable benchmark for power-engineering agents. An agent receives a structured task and returns a structured solution; a deterministic evaluator recomputes the engineering quantities, checks operational constraints, and returns a feasibility flag, a normalized score, and explicit violations. The benchmark contains 41 task families across eight areas of power engineering, from power flow and protection to stability, microgrids, reliability, power quality, and forecasting. Each task is grounded in a citable source, standard, or documented engineering formulation. To resist contamination, held-out cases are synthesized on demand by per-family generators from private seeds: the construction is inspectable, but the instances remain private. In a reference evaluation with three command-line agents, the strongest score near the compact tier's ceiling, a smaller open model trails, and public and held-out performance are broadly consistent; a separate public-split grid with OpenCode and Aider probes harness effects. The reference evaluation doubles as quality control: unanimous failures flag candidate task or evaluator defects, and it exposed a latent evaluator bug missed by self-consistency checks. The evaluators are compact deterministic surrogates, but the task contract allows their internals to be upgraded to simulator-backed checks without changing how tasks are posed or solved.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a new executable benchmark for power systems agents with 41 task families and deterministic evaluators, but the surrogates lack any validation against full simulators.

read the letter

The main thing here is a benchmark for testing AI agents on power engineering problems where the solutions get run through code to check feasibility, scores, and violations instead of just grading text. It covers 41 task families across eight areas like power flow, protection, stability, and forecasting, each tied to citable standards or formulations, with tasks generated on demand from private seeds to keep them held-out.

What works is the practical setup. The reference evaluation with three command-line agents produces baseline scores and shows consistency between public and held-out splits. The quality control step, where unanimous agent failures flag defects, actually caught a latent evaluator bug. The design also leaves room to upgrade the compact deterministic evaluators to simulator-backed checks later without altering how tasks are posed.

The soft spot is exactly the one the stress-test note flags. The evaluators are described as surrogates, but there is no reported comparison or error-bound analysis against established tools like pandapower or PSSE. Unanimous failures only catch gross bugs, not systematic approximation errors in the engineering quantities or constraints. This leaves the proxy validity untested, which directly affects how much the feasibility flags and scores can be trusted as real engineering assessments.

This is for researchers working on AI agents for engineering or infrastructure domains who want a structured way to measure performance beyond retrieval or QA. A reader building similar executable benchmarks would find the task contract and contamination resistance useful.

It deserves peer review. The artifact is new, the methodology is concrete, and the gap on evaluator validation is fixable with additional experiments rather than a load-bearing flaw.

Referee Report

1 major / 0 minor

Summary. The paper introduces the Power Systems Agent Benchmark, an executable benchmark for AI agents in electric power engineering. Agents receive structured tasks across 41 families in eight areas (power flow, protection, stability, etc.) and return structured solutions; deterministic evaluators recompute quantities, check constraints, and output feasibility flags, normalized scores, and violations. Tasks are grounded in citable sources/standards, with on-demand synthesis from private seeds for held-out instances to resist contamination. A reference evaluation with three command-line agents is reported, along with quality control via unanimous agent failures that exposed an evaluator bug; evaluators are described as compact deterministic surrogates that can later be upgraded to simulator-backed checks without altering the task interface.

Significance. If the benchmark's evaluators prove reliable, this work supplies a much-needed executable evaluation framework for a domain where AI use remains largely limited to retrieval and text QA. Credit is due for the contamination-resistant design (on-demand synthesis from private seeds), the explicit grounding in external standards, the quality-control mechanism that detected a latent bug, and the forward-compatible task contract that permits evaluator upgrades. These features position the artifact as a reusable, inspectable starting point rather than a one-off leaderboard.

major comments (1)

[Abstract and Reference Evaluation description] The central claim that the benchmark supplies valid executable evaluation rests on the 41 task families' deterministic evaluators correctly recomputing quantities and enforcing constraints. However, the manuscript reports no direct comparison, error-bound analysis, or validation of these compact surrogates against established full simulators (e.g., pandapower or PSSE). Quality control via unanimous failures only flags gross bugs, not systematic approximation errors in the engineering formulations. This leaves the proxy-validity assumption untested and directly affects whether benchmark outputs constitute reliable engineering assessment.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the benchmark's design features, including contamination resistance, grounding in standards, and the quality-control mechanism. We address the single major comment below.

read point-by-point responses

Referee: [Abstract and Reference Evaluation description] The central claim that the benchmark supplies valid executable evaluation rests on the 41 task families' deterministic evaluators correctly recomputing quantities and enforcing constraints. However, the manuscript reports no direct comparison, error-bound analysis, or validation of these compact surrogates against established full simulators (e.g., pandapower or PSSE). Quality control via unanimous failures only flags gross bugs, not systematic approximation errors in the engineering formulations. This leaves the proxy-validity assumption untested and directly affects whether benchmark outputs constitute reliable engineering assessment.

Authors: We agree that the manuscript provides no direct numerical comparison or error-bound analysis of the compact evaluators against full simulators such as pandapower or PSSE. Each evaluator implements a deterministic version of the engineering calculation drawn from the citable source or standard listed for its task family; the formulations are therefore transparent and inspectable rather than black-box approximations. Nevertheless, the absence of an empirical validation study against established simulators leaves the magnitude of any systematic discrepancy unquantified, which is a genuine limitation for claims of proxy validity. In the revised manuscript we will add an explicit Limitations subsection that states this gap, reiterates the forward-compatible task contract that permits future replacement by simulator-backed evaluators, and outlines a concrete plan for such validation on a representative subset of task families. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark is a new artifact grounded in external sources

full rationale

The paper introduces the Power Systems Agent Benchmark as a new executable evaluation artifact. Tasks are explicitly grounded in citable external standards or documented engineering formulations, with held-out cases generated on demand from private seeds. Deterministic evaluators are described as compact surrogates that can be upgraded without changing task contracts. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The central construction does not reduce to its own inputs by definition or by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on standard engineering formulations from external sources and the assumption that deterministic evaluators can serve as valid surrogates; no free parameters, ad-hoc axioms, or invented entities are introduced.

axioms (1)

domain assumption Each task is grounded in a citable source, standard, or documented engineering formulation.
Stated directly in the abstract as the grounding for all 41 task families.

pith-pipeline@v0.9.1-grok · 5824 in / 1223 out tokens · 24668 ms · 2026-06-26T16:56:09.531346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 13 linked inside Pith

[1]

Large language models for power system applications: A comprehensive literature survey.arXiv preprint arXiv:2512.13004, 2025

Muhammad Sarwar, Muhammad Rizwan, Mubushra Aziz, and Abdul Rehman Sudais. Large language models for power system applications: A comprehensive literature survey.arXiv preprint arXiv:2512.13004, 2025

arXiv 2025
[2]

Agentic AI systems in electrical power systems engineering: Current state-of-the-art and challenges.arXiv preprint arXiv:2511.14478, 2025

Soham Ghosh and Gaurav Mittal. Agentic AI systems in electrical power systems engineering: Current state-of-the-art and challenges.arXiv preprint arXiv:2511.14478, 2025

arXiv 2025
[3]

Gridmind: LLMs-powered agents for power system analysis and operations.arXiv preprint arXiv:2509.02494, 2025

Hongwei Jin, Kibaek Kim, and Jonghwan Kwon. Gridmind: LLMs-powered agents for power system analysis and operations.arXiv preprint arXiv:2509.02494, 2025

arXiv 2025
[4]

X-gridagent: An LLM-powered agentic AI system for assisting power grid analysis.arXiv preprint arXiv:2512.20789, 2025

Yihan Wen and Xin Chen. X-gridagent: An LLM-powered agentic AI system for assisting power grid analysis.arXiv preprint arXiv:2512.20789, 2025

arXiv 2025
[5]

Judging LLM-as-a-judge with MT-bench and chatbot arena

Lianmin Zheng et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. InNeurIPS Datasets and Benchmarks, 2023. arXiv:2306.05685

Pith/arXiv arXiv 2023
[6]

Largelanguagemodelsarenotfairevaluators.arXiv preprint arXiv:2305.17926, 2023

PeiyiWangetal. Largelanguagemodelsarenotfairevaluators.arXiv preprint arXiv:2305.17926, 2023

Pith/arXiv arXiv 2023
[7]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024. arXiv:2310.06770

Pith/arXiv arXiv 2024
[8]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

Pith/arXiv arXiv 2024
[9]

Terminal-bench: Benchmarking ai agents on realistic terminal tasks

Laude Institute and Stanford University. Terminal-bench: Benchmarking ai agents on realistic terminal tasks. https://www.tbench.ai/, 2025

2025
[10]

Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

Mark Chen et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

Pith/arXiv arXiv 2021
[11]

MLE-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

Jun Shern Chan et al. MLE-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

Pith/arXiv arXiv 2024
[12]

Agentbench: Evaluating LLMs as agents

Xiao Liu et al. Agentbench: Evaluating LLMs as agents. InICLR, 2024. arXiv:2308.03688

Pith/arXiv arXiv 2024
[13]

GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023

Pith/arXiv arXiv 2023
[14]

Webarena: A realistic web environment for building autonomous agents

Shuyan Zhou et al. Webarena: A realistic web environment for building autonomous agents. In ICLR, 2024. arXiv:2307.13854. 14

Pith/arXiv arXiv 2024
[15]

Agent-as-a-judge: Evaluate agents with agents

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. Agent-as-a-judge: Evaluate agents with agents. arXiv preprint arXiv:2410.10934, 2024

arXiv 2024
[16]

When AIs judge AIs: The rise of agent-as-a-judge evaluation for LLMs.arXiv preprint arXiv:2508.02994, 2025

Fangyi Yu. When AIs judge AIs: The rise of agent-as-a-judge evaluation for LLMs.arXiv preprint arXiv:2508.02994, 2025

arXiv 2025
[17]

PowerAgentBench: Standardized tasks, environments, and metrics for power-system agents

PowerAgent community, Harvard SEAS. PowerAgentBench: Standardized tasks, environments, and metrics for power-system agents. GitHub repository, Power-Agent/PowerAgentBench, 2026. URL https://github.com/Power-Agent/PowerAgentBench. Benchmark component of the PowerAgent ecosystem (poweragent.seas.harvard.edu); no dedicated publication at the time of writing

2026
[18]

Poweragent: A road map toward agentic intelligence in power systems: Foundation model, model context protocol, and workflow.IEEE Power & Energy Magazine, 23(5):93–101, 2025

Qian Zhang and Le Xie. Poweragent: A road map toward agentic intelligence in power systems: Foundation model, model context protocol, and workflow.IEEE Power & Energy Magazine, 23(5):93–101, 2025

2025
[19]

Elecbench: a power dispatch evaluation benchmark for large language models

Xiyuan Zhou et al. Elecbench: a power dispatch evaluation benchmark for large language models. InIEEE PES General Meeting, 2025. arXiv:2407.05365; Best Paper

arXiv 2025
[20]

IEEE DataPort, 2025

PFBench: Power-flow benchmark for LLM-based power system agent evaluation. IEEE DataPort, 2025. URL https://ieee-dataport.org/documents/power-flow-benchmark-llm-based- power-system-agent-evaluation-pfbench

2025
[21]

Grid-mind: An LLM-orchestrated multi-fidelity agent for automated connection impact assessment.arXiv preprint arXiv:2602.20683, 2026

Mohamed Shamseldein. Grid-mind: An LLM-orchestrated multi-fidelity agent for automated connection impact assessment.arXiv preprint arXiv:2602.20683, 2026

arXiv 2026
[22]

PFAgent: A tractable and self- evolving power-flow agent for interactive grid analysis.arXiv preprint arXiv:2604.10846, 2026

Buxin She, Brian Chen, Luanzheng Guo, and Fangxing Li. PFAgent: A tractable and self- evolving power-flow agent for interactive grid analysis.arXiv preprint arXiv:2604.10846, 2026

Pith/arXiv arXiv 2026
[23]

Learning to run a power network challenge for training topology controllers.Electric Power Systems Research, 189, 2020

Antoine Marot, Benjamin Donnot, Camilo Romero, Balthazar Donon, Marvin Lerousseau, Luca Veyrin-Forrer, and Isabelle Guyon. Learning to run a power network challenge for training topology controllers.Electric Power Systems Research, 189, 2020. arXiv:1912.04211

arXiv 2020
[24]

Learning to run a power network challenge: a retrospective analysis

Antoine Marot, Benjamin Donnot, Gabriel Dulac-Arnold, Adrian Kelly, Aïdan O’Sullivan, Jan Viebahn, Mariette Awad, Isabelle Guyon, Patrick Panciatici, and Camilo Romero. Learning to run a power network challenge: a retrospective analysis. InNeurIPS 2020 Competition and Demonstration Track, PMLR v133, pages 112–132, 2021

2020
[25]

pandapower — an open-source python tool for convenient modeling, analysis, and optimization of electric power systems.IEEE Transactions on Power Systems, 33(6):6510–6521, 2018

Leon Thurner, Alexander Scheidler, Florian Schäfer, Jan-Hendrik Menke, Julian Dollichon, Friederike Meier, Steffen Meinecke, and Martin Braun. pandapower — an open-source python tool for convenient modeling, analysis, and optimization of electric power systems.IEEE Transactions on Power Systems, 33(6):6510–6521, 2018

2018
[26]

Zimmerman, Carlos E

Ray D. Zimmerman, Carlos E. Murillo-Sánchez, and Robert J. Thomas. MATPOWER: Steady- state operations, planning, and analysis tools for power systems research and education.IEEE Transactions on Power Systems, 26(1):12–19, 2011

2011
[27]

Dugan and Thomas E

Roger C. Dugan and Thomas E. McDermott. An open source platform for collaborating on smart grid research. InIEEE PES General Meeting, 2011. 15

2011
[28]

Powermodels.jl: Anopen-sourceframeworkforexploringpowerflowformulations

Carleton Coffrin, Russell Bent, Kaarthik Sundar, Yeesian Ng, and Miles Lubin. Powermodels.jl: Anopen-sourceframeworkforexploringpowerflowformulations. InPower Systems Computation Conference (PSCC), 2018

2018
[29]

The power grid library for benchmarking AC optimal power flow algorithms.arXiv preprint arXiv:1908.02788, 2019

Sogol Babaeinejadsarookolaee et al. The power grid library for benchmarking AC optimal power flow algorithms.arXiv preprint arXiv:1908.02788, 2019

arXiv 1908
[30]

Hybrid symbolic-numeric framework for power system modeling and analysis.IEEE Transactions on Power Systems, 36(2):1373–1384, 2021

Hantao Cui, Fangxing Li, and Kevin Tomsovic. Hybrid symbolic-numeric framework for power system modeling and analysis.IEEE Transactions on Power Systems, 36(2):1373–1384, 2021

2021
[31]

PyPSA: Python for power system analysis.Journal of Open Research Software, 6(1), 2018

Tom Brown, Jonas Hörsch, and David Schlachtberger. PyPSA: Python for power system analysis.Journal of Open Research Software, 6(1), 2018

2018
[32]

A survey on data contamination for large language models.arXiv preprint arXiv:2502.14425, 2025

Yuxing Cheng, Yi Chang, and Yuan Wu. A survey on data contamination for large language models.arXiv preprint arXiv:2502.14425, 2025

arXiv 2025
[33]

Recent advances in large language model benchmarks against data contamination: From static to dynamic evaluation.arXiv preprint arXiv:2502.17521, 2025

Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, and Baishakhi Ray. Recent advances in large language model benchmarks against data contamination: From static to dynamic evaluation.arXiv preprint arXiv:2502.17521, 2025

arXiv 2025
[34]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter.arXiv preprint arXiv:2407.01502, 2024

arXiv 2024
[35]

Le, Christopher Ré, and Azalia Mirhoseini

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

Pith/arXiv arXiv 2024
[36]

Graph computing based fast screening in contingency analysis.arXiv preprint arXiv:1904.00044, 2019

Yiting Zhao, Chen Yuan, Sun Li, Guangyi Liu, Renchang Dai, and Zhiwei Wang. Graph computing based fast screening in contingency analysis.arXiv preprint arXiv:1904.00044, 2019

Pith/arXiv arXiv 1904
[37]

IEEE Std 738-2012 standard for calculating the current-temperature relationship of bare overhead conductors, 2012

2012
[38]

systems — part 0: Calculation of currents, 2016

IEC 60909-0:2016 short-circuit currents in three-phase a.c. systems — part 0: Calculation of currents, 2016

2016
[39]

Anderson.Analysis of Faulted Power Systems

Paul M. Anderson.Analysis of Faulted Power Systems. Wiley-IEEE Press, 1995. ISBN 978-0-7803-1145-9

1995
[40]

Horowitz and Arun G

Stanley H. Horowitz and Arun G. Phadke.Power System Relaying. Wiley, 4th edition, 2014. ISBN 978-1-118-66200-7

2014
[41]

IEC 60364-5-52 low-voltage electrical installations — selection and erection of electrical equip- ment — wiring systems, 2009

2009
[42]

McGraw-Hill, 1994

Prabha Kundur.Power System Stability and Control. McGraw-Hill, 1994

1994
[43]

IEEE Std 2800-2022 standard for interconnection and interoperability of inverter-based resources interconnecting with associated transmission electric power systems, 2022

2022
[44]

ENTSO-E network code on requirements for grid connection of generators (RfG), 2016

2016
[45]

Local control of reactive power by distributed photovoltaic generators.arXiv preprint arXiv:1006.0160, 2010

Konstantin Turitsyn, Petr Šulc, Scott Backhaus, and Michael Chertkov. Local control of reactive power by distributed photovoltaic generators.arXiv preprint arXiv:1006.0160, 2010. 16

Pith/arXiv arXiv 2010
[47]

A two-stage service restoration method for electric power distribution systems.arXiv preprint arXiv:2004.07921, 2020

Shiva Poudel and Anamika Dubey. A two-stage service restoration method for electric power distribution systems.arXiv preprint arXiv:2004.07921, 2020

arXiv 2004
[48]

IEEE Std 1366-2012 guide for electric power distribution reliability indices, 2012

2012
[49]

EN 50160:2010 voltage characteristics of electricity supplied by public distribution networks, 2010

2010
[50]

IEEE Std 519-2014 recommended practice and requirements for harmonic control in electric power systems, 2014

2014
[51]

IEC 60076-7:2018 power transformers — part 7: Loading guide for mineral-oil-immersed power transformers, 2018

2018
[52]

Marcel Dekker, 2004

Ali Abur and Antonio Gómez Expósito.Power System State Estimation: Theory and Imple- mentation. Marcel Dekker, 2004

2004
[53]

Yao Liu, Peng Ning, and Michael K. Reiter. False data injection attacks against state estimation in electric power grids.ACM Transactions on Information and System Security, 14(1), 2011. First presented at ACM CCS 2009

2011
[54]

Harbor: A framework for running agent evaluations and rl environments

Terminal-Bench Team. Harbor: A framework for running agent evaluations and rl environments. https://github.com/harbor-framework/harbor, 2026. Container-based agent-evaluation harness released with Terminal-Bench 2.0. Appendix A. Task Catalog The 41 families are listed below by domain area, each with its primary source or governing standard and the confide...

2026
[56]

Short circuit and protection Family Source / standard Confidence three_phase_short_circuitIEC 60909 high earth_fault_calculation Anderson, Analysis of Faulted Power Systems medium breaker_relay_short_circuit IEC 60909; Glover, Overbye & Sarmahigh 17 Family Source / standard Confidence distance_protection_settingsHorowitz & Phadke, Power System Relaying me...
[57]

Stability, grid code, and inverter-based resources Family Source / standard Confidence critical_clearing_timeEqual-area criterion (Kundur) high transient_stability_predictionChen et al., transient-stability prediction medium frt_complianceIEEE 2800 / ENTSO-E RfG (FRT) medium ibr_short_circuit_frt IBR modeling for short circuit / FRTmedium min_synchronous_...
[58]

Distributed resources, PV, EV, and storage Family Source / standard Confidence pv_volt_varTuritsyn et al., local Volt-VAR control medium ev_v2g_outage_scheduleEVs for power quality & security medium ev_v2g_voltage_supportEVs for power quality & security medium bess_ancillary_responseGonzalez-Longatt & Rueda Torres medium commercial_pv_lcoe_uncertaintyPV s...
[59]

Microgrids and dispatch Family Source / standard Confidence microgrid_economic_dispatchEspaña et al., microgrid dispatch medium rolling_microgrid_dispatchEspaña et al., microgrid dispatch medium islanded_microgrid_pq_dispatchEspaña et al., microgrid dispatch medium dispatch_uncertaintyChung, advanced prediction for smart grids low hydro_thermal_storage_uc...

arXiv 2001
[60]

Reliability and restoration Family Source / standard Confidence flisr_restorationTwo-stage distribution service restoration medium fault_section_localizationBrown, faulted-circuit indicators medium fci_placementBrown, faulted-circuit indicators medium fci_saidi_caidiBrown, FCIs; IEEE 1366 indices medium operator_breaker_load_actions Glover, Overbye & Sarm...
[61]

Power quality, standards, assets, and cybersecurity Family Source / standard Confidence en50160_voltage_complianceEN 50160 high power_quality_event_classificationIEC 61000-4-30 high harmonic_ieee519_complianceIEEE 519 high transformer_thermal_loadingIEC 60076-7 high fdi_state_estimationAbur & Exposito, state estimation high protected_meter_placementGraphi...
[62]

Forecasting under uncertainty Family Source / standard Confidence wind_power_forecastSafari et al., short-term wind forecasting medium wind_prediction_intervalKhorramdel et al., fuzzy prediction intervals medium Appendix B. Experiment Artifacts Run configuration.Each agent was invoked once per case through its own command-line interface under a 600-second...

2026

[1] [1]

Large language models for power system applications: A comprehensive literature survey.arXiv preprint arXiv:2512.13004, 2025

Muhammad Sarwar, Muhammad Rizwan, Mubushra Aziz, and Abdul Rehman Sudais. Large language models for power system applications: A comprehensive literature survey.arXiv preprint arXiv:2512.13004, 2025

arXiv 2025

[2] [2]

Agentic AI systems in electrical power systems engineering: Current state-of-the-art and challenges.arXiv preprint arXiv:2511.14478, 2025

Soham Ghosh and Gaurav Mittal. Agentic AI systems in electrical power systems engineering: Current state-of-the-art and challenges.arXiv preprint arXiv:2511.14478, 2025

arXiv 2025

[3] [3]

Gridmind: LLMs-powered agents for power system analysis and operations.arXiv preprint arXiv:2509.02494, 2025

Hongwei Jin, Kibaek Kim, and Jonghwan Kwon. Gridmind: LLMs-powered agents for power system analysis and operations.arXiv preprint arXiv:2509.02494, 2025

arXiv 2025

[4] [4]

X-gridagent: An LLM-powered agentic AI system for assisting power grid analysis.arXiv preprint arXiv:2512.20789, 2025

Yihan Wen and Xin Chen. X-gridagent: An LLM-powered agentic AI system for assisting power grid analysis.arXiv preprint arXiv:2512.20789, 2025

arXiv 2025

[5] [5]

Judging LLM-as-a-judge with MT-bench and chatbot arena

Lianmin Zheng et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. InNeurIPS Datasets and Benchmarks, 2023. arXiv:2306.05685

Pith/arXiv arXiv 2023

[6] [6]

Largelanguagemodelsarenotfairevaluators.arXiv preprint arXiv:2305.17926, 2023

PeiyiWangetal. Largelanguagemodelsarenotfairevaluators.arXiv preprint arXiv:2305.17926, 2023

Pith/arXiv arXiv 2023

[7] [7]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024. arXiv:2310.06770

Pith/arXiv arXiv 2024

[8] [8]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

Pith/arXiv arXiv 2024

[9] [9]

Terminal-bench: Benchmarking ai agents on realistic terminal tasks

Laude Institute and Stanford University. Terminal-bench: Benchmarking ai agents on realistic terminal tasks. https://www.tbench.ai/, 2025

2025

[10] [10]

Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

Mark Chen et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

Pith/arXiv arXiv 2021

[11] [11]

MLE-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

Jun Shern Chan et al. MLE-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

Pith/arXiv arXiv 2024

[12] [12]

Agentbench: Evaluating LLMs as agents

Xiao Liu et al. Agentbench: Evaluating LLMs as agents. InICLR, 2024. arXiv:2308.03688

Pith/arXiv arXiv 2024

[13] [13]

GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023

Pith/arXiv arXiv 2023

[14] [14]

Webarena: A realistic web environment for building autonomous agents

Shuyan Zhou et al. Webarena: A realistic web environment for building autonomous agents. In ICLR, 2024. arXiv:2307.13854. 14

Pith/arXiv arXiv 2024

[15] [15]

Agent-as-a-judge: Evaluate agents with agents

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. Agent-as-a-judge: Evaluate agents with agents. arXiv preprint arXiv:2410.10934, 2024

arXiv 2024

[16] [16]

When AIs judge AIs: The rise of agent-as-a-judge evaluation for LLMs.arXiv preprint arXiv:2508.02994, 2025

Fangyi Yu. When AIs judge AIs: The rise of agent-as-a-judge evaluation for LLMs.arXiv preprint arXiv:2508.02994, 2025

arXiv 2025

[17] [17]

PowerAgentBench: Standardized tasks, environments, and metrics for power-system agents

PowerAgent community, Harvard SEAS. PowerAgentBench: Standardized tasks, environments, and metrics for power-system agents. GitHub repository, Power-Agent/PowerAgentBench, 2026. URL https://github.com/Power-Agent/PowerAgentBench. Benchmark component of the PowerAgent ecosystem (poweragent.seas.harvard.edu); no dedicated publication at the time of writing

2026

[18] [18]

Poweragent: A road map toward agentic intelligence in power systems: Foundation model, model context protocol, and workflow.IEEE Power & Energy Magazine, 23(5):93–101, 2025

Qian Zhang and Le Xie. Poweragent: A road map toward agentic intelligence in power systems: Foundation model, model context protocol, and workflow.IEEE Power & Energy Magazine, 23(5):93–101, 2025

2025

[19] [19]

Elecbench: a power dispatch evaluation benchmark for large language models

Xiyuan Zhou et al. Elecbench: a power dispatch evaluation benchmark for large language models. InIEEE PES General Meeting, 2025. arXiv:2407.05365; Best Paper

arXiv 2025

[20] [20]

IEEE DataPort, 2025

PFBench: Power-flow benchmark for LLM-based power system agent evaluation. IEEE DataPort, 2025. URL https://ieee-dataport.org/documents/power-flow-benchmark-llm-based- power-system-agent-evaluation-pfbench

2025

[21] [21]

Grid-mind: An LLM-orchestrated multi-fidelity agent for automated connection impact assessment.arXiv preprint arXiv:2602.20683, 2026

Mohamed Shamseldein. Grid-mind: An LLM-orchestrated multi-fidelity agent for automated connection impact assessment.arXiv preprint arXiv:2602.20683, 2026

arXiv 2026

[22] [22]

PFAgent: A tractable and self- evolving power-flow agent for interactive grid analysis.arXiv preprint arXiv:2604.10846, 2026

Buxin She, Brian Chen, Luanzheng Guo, and Fangxing Li. PFAgent: A tractable and self- evolving power-flow agent for interactive grid analysis.arXiv preprint arXiv:2604.10846, 2026

Pith/arXiv arXiv 2026

[23] [23]

Learning to run a power network challenge for training topology controllers.Electric Power Systems Research, 189, 2020

Antoine Marot, Benjamin Donnot, Camilo Romero, Balthazar Donon, Marvin Lerousseau, Luca Veyrin-Forrer, and Isabelle Guyon. Learning to run a power network challenge for training topology controllers.Electric Power Systems Research, 189, 2020. arXiv:1912.04211

arXiv 2020

[24] [24]

Learning to run a power network challenge: a retrospective analysis

Antoine Marot, Benjamin Donnot, Gabriel Dulac-Arnold, Adrian Kelly, Aïdan O’Sullivan, Jan Viebahn, Mariette Awad, Isabelle Guyon, Patrick Panciatici, and Camilo Romero. Learning to run a power network challenge: a retrospective analysis. InNeurIPS 2020 Competition and Demonstration Track, PMLR v133, pages 112–132, 2021

2020

[25] [25]

pandapower — an open-source python tool for convenient modeling, analysis, and optimization of electric power systems.IEEE Transactions on Power Systems, 33(6):6510–6521, 2018

Leon Thurner, Alexander Scheidler, Florian Schäfer, Jan-Hendrik Menke, Julian Dollichon, Friederike Meier, Steffen Meinecke, and Martin Braun. pandapower — an open-source python tool for convenient modeling, analysis, and optimization of electric power systems.IEEE Transactions on Power Systems, 33(6):6510–6521, 2018

2018

[26] [26]

Zimmerman, Carlos E

Ray D. Zimmerman, Carlos E. Murillo-Sánchez, and Robert J. Thomas. MATPOWER: Steady- state operations, planning, and analysis tools for power systems research and education.IEEE Transactions on Power Systems, 26(1):12–19, 2011

2011

[27] [27]

Dugan and Thomas E

Roger C. Dugan and Thomas E. McDermott. An open source platform for collaborating on smart grid research. InIEEE PES General Meeting, 2011. 15

2011

[28] [28]

Powermodels.jl: Anopen-sourceframeworkforexploringpowerflowformulations

Carleton Coffrin, Russell Bent, Kaarthik Sundar, Yeesian Ng, and Miles Lubin. Powermodels.jl: Anopen-sourceframeworkforexploringpowerflowformulations. InPower Systems Computation Conference (PSCC), 2018

2018

[29] [29]

The power grid library for benchmarking AC optimal power flow algorithms.arXiv preprint arXiv:1908.02788, 2019

Sogol Babaeinejadsarookolaee et al. The power grid library for benchmarking AC optimal power flow algorithms.arXiv preprint arXiv:1908.02788, 2019

arXiv 1908

[30] [30]

Hybrid symbolic-numeric framework for power system modeling and analysis.IEEE Transactions on Power Systems, 36(2):1373–1384, 2021

Hantao Cui, Fangxing Li, and Kevin Tomsovic. Hybrid symbolic-numeric framework for power system modeling and analysis.IEEE Transactions on Power Systems, 36(2):1373–1384, 2021

2021

[31] [31]

PyPSA: Python for power system analysis.Journal of Open Research Software, 6(1), 2018

Tom Brown, Jonas Hörsch, and David Schlachtberger. PyPSA: Python for power system analysis.Journal of Open Research Software, 6(1), 2018

2018

[32] [32]

A survey on data contamination for large language models.arXiv preprint arXiv:2502.14425, 2025

Yuxing Cheng, Yi Chang, and Yuan Wu. A survey on data contamination for large language models.arXiv preprint arXiv:2502.14425, 2025

arXiv 2025

[33] [33]

Recent advances in large language model benchmarks against data contamination: From static to dynamic evaluation.arXiv preprint arXiv:2502.17521, 2025

Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, and Baishakhi Ray. Recent advances in large language model benchmarks against data contamination: From static to dynamic evaluation.arXiv preprint arXiv:2502.17521, 2025

arXiv 2025

[34] [34]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter.arXiv preprint arXiv:2407.01502, 2024

arXiv 2024

[35] [35]

Le, Christopher Ré, and Azalia Mirhoseini

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

Pith/arXiv arXiv 2024

[36] [36]

Graph computing based fast screening in contingency analysis.arXiv preprint arXiv:1904.00044, 2019

Yiting Zhao, Chen Yuan, Sun Li, Guangyi Liu, Renchang Dai, and Zhiwei Wang. Graph computing based fast screening in contingency analysis.arXiv preprint arXiv:1904.00044, 2019

Pith/arXiv arXiv 1904

[37] [37]

IEEE Std 738-2012 standard for calculating the current-temperature relationship of bare overhead conductors, 2012

2012

[38] [38]

systems — part 0: Calculation of currents, 2016

IEC 60909-0:2016 short-circuit currents in three-phase a.c. systems — part 0: Calculation of currents, 2016

2016

[39] [39]

Anderson.Analysis of Faulted Power Systems

Paul M. Anderson.Analysis of Faulted Power Systems. Wiley-IEEE Press, 1995. ISBN 978-0-7803-1145-9

1995

[40] [40]

Horowitz and Arun G

Stanley H. Horowitz and Arun G. Phadke.Power System Relaying. Wiley, 4th edition, 2014. ISBN 978-1-118-66200-7

2014

[41] [41]

IEC 60364-5-52 low-voltage electrical installations — selection and erection of electrical equip- ment — wiring systems, 2009

2009

[42] [42]

McGraw-Hill, 1994

Prabha Kundur.Power System Stability and Control. McGraw-Hill, 1994

1994

[43] [43]

IEEE Std 2800-2022 standard for interconnection and interoperability of inverter-based resources interconnecting with associated transmission electric power systems, 2022

2022

[44] [44]

ENTSO-E network code on requirements for grid connection of generators (RfG), 2016

2016

[45] [45]

Local control of reactive power by distributed photovoltaic generators.arXiv preprint arXiv:1006.0160, 2010

Konstantin Turitsyn, Petr Šulc, Scott Backhaus, and Michael Chertkov. Local control of reactive power by distributed photovoltaic generators.arXiv preprint arXiv:1006.0160, 2010. 16

Pith/arXiv arXiv 2010

[46] [47]

A two-stage service restoration method for electric power distribution systems.arXiv preprint arXiv:2004.07921, 2020

Shiva Poudel and Anamika Dubey. A two-stage service restoration method for electric power distribution systems.arXiv preprint arXiv:2004.07921, 2020

arXiv 2004

[47] [48]

IEEE Std 1366-2012 guide for electric power distribution reliability indices, 2012

2012

[48] [49]

EN 50160:2010 voltage characteristics of electricity supplied by public distribution networks, 2010

2010

[49] [50]

IEEE Std 519-2014 recommended practice and requirements for harmonic control in electric power systems, 2014

2014

[50] [51]

IEC 60076-7:2018 power transformers — part 7: Loading guide for mineral-oil-immersed power transformers, 2018

2018

[51] [52]

Marcel Dekker, 2004

Ali Abur and Antonio Gómez Expósito.Power System State Estimation: Theory and Imple- mentation. Marcel Dekker, 2004

2004

[52] [53]

Yao Liu, Peng Ning, and Michael K. Reiter. False data injection attacks against state estimation in electric power grids.ACM Transactions on Information and System Security, 14(1), 2011. First presented at ACM CCS 2009

2011

[53] [54]

Harbor: A framework for running agent evaluations and rl environments

Terminal-Bench Team. Harbor: A framework for running agent evaluations and rl environments. https://github.com/harbor-framework/harbor, 2026. Container-based agent-evaluation harness released with Terminal-Bench 2.0. Appendix A. Task Catalog The 41 families are listed below by domain area, each with its primary source or governing standard and the confide...

2026

[54] [56]

Short circuit and protection Family Source / standard Confidence three_phase_short_circuitIEC 60909 high earth_fault_calculation Anderson, Analysis of Faulted Power Systems medium breaker_relay_short_circuit IEC 60909; Glover, Overbye & Sarmahigh 17 Family Source / standard Confidence distance_protection_settingsHorowitz & Phadke, Power System Relaying me...

[55] [57]

Stability, grid code, and inverter-based resources Family Source / standard Confidence critical_clearing_timeEqual-area criterion (Kundur) high transient_stability_predictionChen et al., transient-stability prediction medium frt_complianceIEEE 2800 / ENTSO-E RfG (FRT) medium ibr_short_circuit_frt IBR modeling for short circuit / FRTmedium min_synchronous_...

[56] [58]

Distributed resources, PV, EV, and storage Family Source / standard Confidence pv_volt_varTuritsyn et al., local Volt-VAR control medium ev_v2g_outage_scheduleEVs for power quality & security medium ev_v2g_voltage_supportEVs for power quality & security medium bess_ancillary_responseGonzalez-Longatt & Rueda Torres medium commercial_pv_lcoe_uncertaintyPV s...

[57] [59]

Microgrids and dispatch Family Source / standard Confidence microgrid_economic_dispatchEspaña et al., microgrid dispatch medium rolling_microgrid_dispatchEspaña et al., microgrid dispatch medium islanded_microgrid_pq_dispatchEspaña et al., microgrid dispatch medium dispatch_uncertaintyChung, advanced prediction for smart grids low hydro_thermal_storage_uc...

arXiv 2001

[58] [60]

Reliability and restoration Family Source / standard Confidence flisr_restorationTwo-stage distribution service restoration medium fault_section_localizationBrown, faulted-circuit indicators medium fci_placementBrown, faulted-circuit indicators medium fci_saidi_caidiBrown, FCIs; IEEE 1366 indices medium operator_breaker_load_actions Glover, Overbye & Sarm...

[59] [61]

Power quality, standards, assets, and cybersecurity Family Source / standard Confidence en50160_voltage_complianceEN 50160 high power_quality_event_classificationIEC 61000-4-30 high harmonic_ieee519_complianceIEEE 519 high transformer_thermal_loadingIEC 60076-7 high fdi_state_estimationAbur & Exposito, state estimation high protected_meter_placementGraphi...

[60] [62]

Forecasting under uncertainty Family Source / standard Confidence wind_power_forecastSafari et al., short-term wind forecasting medium wind_prediction_intervalKhorramdel et al., fuzzy prediction intervals medium Appendix B. Experiment Artifacts Run configuration.Each agent was invoked once per case through its own command-line interface under a 600-second...

2026