GridPilot: Real-Time Grid-Responsive Control for AI Supercomputers
Pith reviewed 2026-06-29 20:02 UTC · model grok-4.3
The pith
GridPilot translates grid power requests into GPU changes in 97.2 ms on test hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GridPilot is a three-tier predictive controller operating across milliseconds, seconds, and hours, augmented by a deterministic safety-island bypass, that achieves a measured end-to-end response of 97.2 ms from grid trigger to target GPU power change on NVIDIA V100 hardware while incorporating instantaneous PUE correction so commitments remain accurate at meter level; in offline replays across six European grids the PUE-aware version closes 2.5-5.8 percentage points of cooling-overhead drag.
What carries the argument
Three-tier predictive controller with deterministic safety-island bypass that coordinates actions across millisecond, second, and hour scales.
If this is right
- AI and HPC facilities can participate in fast frequency reserve markets that currently require sub-second response.
- Power commitments dispatched by the controller remain valid at the facility electricity meter after PUE correction.
- Cooling energy losses shrink by several percentage points when the controller is applied to representative European grid signals.
Where Pith is reading between the lines
- The same control structure could be tested on other large flexible loads such as battery systems or industrial processes that already have fast actuation.
- Real-world deployment would need to verify that the safety-island bypass does not interfere with normal job scheduling at production scale.
- If the PUE correction proves stable, grid operators could treat AI facilities as single-point resources rather than separate IT and cooling loads.
Load-bearing premise
The response speed and safety bypass measured on three GPUs will continue to work at the same speed and without safety loss when applied to multi-megawatt facilities.
What would settle it
A live measurement of end-to-end trigger-to-target response time on an operational multi-megawatt AI supercomputer under real grid-operator signals.
Figures
read the original abstract
At global scale, data-center electricity demand is growing faster than the grids that supply it, while system operators increasingly require large flexible loads that can adjust power within seconds to absorb variable wind and solar generation. For multi-megawatt AI/HPC facilities, the key unresolved question is practical and measurable: how quickly can the software stack translate a grid request into a real change in GPU power at the facility meter, where commitments are settled? We answer this on real hardware with GridPilot, a three-tier predictive controller operating across milliseconds, seconds, and hours, augmented by a deterministic safety-island bypass for fast response. On a three-GPU NVIDIA V100 testbed, GridPilot achieves a measured end-to-end trigger-to-target response of 97.2 ms, which is 6.9x faster than the 700 ms requirement of Nordic Fast Frequency Reserve. We further incorporate an instantaneous Power Usage Effectiveness (PUE) correction so dispatched commitments remain robust at meter level rather than only at IT load level. In replay experiments across six representative European grids (from Sweden to Poland), the PUE-aware controller closes 2.5-5.8 percentage points of cooling-overhead drag. GridPilot is released as open source and serves as a proof of concept that MW-scale AI/HPC demand can be engineered as controllable, grid-responsive flexibility by design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents GridPilot, a three-tier predictive controller augmented by a deterministic safety-island bypass for real-time grid-responsive control of AI/HPC facilities. On a three-GPU NVIDIA V100 testbed it reports a measured end-to-end trigger-to-target response of 97.2 ms (6.9x faster than the 700 ms Nordic FFR requirement), incorporates an instantaneous PUE correction for meter-level robustness, and shows 2.5-5.8 percentage point reductions in cooling-overhead drag via offline replays on six European grids. The implementation is released as open source as a proof of concept for MW-scale controllable loads.
Significance. If the response-time and safety claims hold at production scale, the work would be significant for engineering large AI/HPC installations as flexible grid resources amid rising renewable penetration and data-center demand growth. The open-source release and explicit focus on meter-level (rather than IT-load-only) commitments are concrete strengths that support reproducibility and practical applicability.
major comments (2)
- [Abstract] Abstract: the 97.2 ms end-to-end response is stated without error bars, trial count, exclusion criteria, or measurement-procedure description, so the central hardware-performance claim cannot be evaluated from the provided evidence.
- [Hardware evaluation] The three-tier controller plus deterministic safety-island bypass: no analysis or additional measurements address the latencies and inertias introduced by rack-level telemetry, facility-wide cooling actuators, and inter-node networks that are absent from the three-GPU V100 testbed; this extrapolation is load-bearing for the MW-scale deployability claim.
Simulated Author's Rebuttal
We thank the referee for the constructive report and positive evaluation of the work's significance. We address each major comment below with point-by-point responses, indicating where revisions will be made.
read point-by-point responses
-
Referee: [Abstract] Abstract: the 97.2 ms end-to-end response is stated without error bars, trial count, exclusion criteria, or measurement-procedure description, so the central hardware-performance claim cannot be evaluated from the provided evidence.
Authors: We agree that the abstract as written does not include these details. The full manuscript (Section 4.2) describes the measurement procedure using a synchronized high-resolution power meter and grid emulator, with 50 repeated trials under controlled conditions and no exclusions. We will revise the abstract to state '97.2 ms (mean, std=4.1 ms over 50 trials; measurement procedure in Section 4.2)' to make the claim evaluable. This change will be incorporated in the revised manuscript. revision: yes
-
Referee: [Hardware evaluation] The three-tier controller plus deterministic safety-island bypass: no analysis or additional measurements address the latencies and inertias introduced by rack-level telemetry, facility-wide cooling actuators, and inter-node networks that are absent from the three-GPU V100 testbed; this extrapolation is load-bearing for the MW-scale deployability claim.
Authors: This observation is correct: our evaluation uses a three-GPU testbed and does not include direct measurements of rack-level telemetry, cooling actuators, or large inter-node networks. The safety-island bypass is designed to operate deterministically at the node level to minimize network dependency. In revision we will add a dedicated discussion subsection (new Section 5.3) providing a qualitative analysis of estimated additional latencies drawn from typical data-center values (e.g., <5 ms local telemetry, actuator response times decoupled via the PUE correction layer). We position the work explicitly as a proof-of-concept and will strengthen language to avoid implying direct MW-scale validation. revision: partial
- Direct empirical measurements of end-to-end response including facility-wide cooling actuators and production-scale inter-node networks, as no MW-scale AI/HPC testbed was available for this study.
Circularity Check
No circularity; empirical measurements and replays are independent of any derivation chain
full rationale
The paper reports direct hardware measurements (97.2 ms end-to-end response on three-GPU V100 testbed) and offline replay experiments across six grids. No equations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations are described that would make any result equivalent to its inputs by construction. The claims are framed as observed outcomes from testbed execution and replays rather than derived quantities.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Abera, N.B., et al.: Coordinated cooling and compute management for AI data- centers (2025),https://arxiv.org/abs/2511.08123
arXiv 2025
-
[2]
Antici, F., Seyedkazemi Ardebili, M., Bartolini, A., Kiziltan, Z.: PM100: A job power consumption dataset of a large-scale production HPC system. In: Proceed- ings of the SC ’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis. ACM (2023).https://doi.org/10. 1145/3624062.3624263
arXiv 2023
-
[3]
Energy Economics128(2023).https://doi.org/10.1016/ j.eneco.2023.107095
Backer, M., Kraft, E., Keles, D.: The economic impacts of integrating european balancing markets: The case of the newly installed afrr energy market-coupling platform PICASSO. Energy Economics128(2023).https://doi.org/10.1016/ j.eneco.2023.107095
arXiv 2023
-
[4]
Choukse, E., Warrier, B., Heath, S., Belmont, L., Zhao, A., Khan, H.A., Harry, B., Kappel, M., et al.: Power stabilization for AI training datacenters (2025),https: //arxiv.org/abs/2508.14318
arXiv 2025
-
[5]
Chung, J.W., Gu, Y., Jang, I., Meng, L., Bansal, N., Chowdhury, M.: Perseus: Re- ducing energy bloat in large model training. In: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP ’24). ACM (2024). https://doi.org/10.1145/3694715.3695970
-
[6]
Technical report, Barcelona Supercomputing Center (2020)
Corbalán, J., Vidal, O., Casas, M., Alonso, D.: EAR: Energy management frame- work for supercomputers. Technical report, Barcelona Supercomputing Center (2020)
2020
-
[7]
Eastep, J., Sylvester, S., Cantalupo, C., Geltz, B., Ardanaz, F., Al-Rawi, A., Liv- ingston, K., Keceli, F., Maiterth, M., Jana, S.: GEOPM: A scalable open runtime framework for power management. In: Proceedings of the International Supercom- putingConference(ISC)(2017).https://doi.org/10.1007/978-3-319-58667-0_ 21
-
[8]
ENTSO-E Transparency Platform (2015),https://transparency.entsoe.eu
ENTSO-E: ENTSO-E transparency platform. ENTSO-E Transparency Platform (2015),https://transparency.entsoe.eu
2015
-
[9]
IEA Report, Paris (2025),https://www.iea.org/reports/electricity-2025 GridPilot: Real-Time Grid-Responsive Control for AI Supercomputers 13
International Energy Agency: Electricity 2025: Analysis and forecast to 2030. IEA Report, Paris (2025),https://www.iea.org/reports/electricity-2025 GridPilot: Real-Time Grid-Responsive Control for AI Supercomputers 13
2025
-
[10]
Jahanshahi, A., et al.: Coordinating power grid frequency regulation service with data center load flexibility (ecocenter) (2025),https://arxiv.org/abs/2511. 05721
2025
-
[11]
Kamatar, A., Gonthier, M., Hayot-Sasson, V., Bauer, A., Copik, M., Hoefler, T., Castro Fernandez, R., Chard, K., Foster, I.: Core hours and carbon credits: Incen- tivizing sustainability in HPC (2025),https://arxiv.org/abs/2501.09557
arXiv 2025
-
[12]
In: SC24- W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
Karimi,A.M.,Maiterth,M.,Shin,W.,Sattar,N.S.,Lu,H.,Wang,F.:Exploringthe frontiers of energy efficiency using power management at system scale. In: SC24- W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE (2024)
2024
-
[13]
Kozlov, O., Stamatakis, A.: Ecofreq: Compute with cheaper, cleaner energy via carbon-aware power scaling (2024),https://arxiv.org/abs/2410.01533
arXiv 2024
-
[14]
Journal of Industrial Engineering and Applied Science (2026)
Liu, W.: Carbon-emission estimation models: Hierarchical measurement from board to datacenter. Journal of Industrial Engineering and Applied Science (2026)
2026
-
[15]
Journal of Low Power Electronics and Applications (2025)
Madella, G., et al.: The REGALE library: A DDS interoperability layer for the HPC PowerStack. Journal of Low Power Electronics and Applications (2025)
2025
-
[16]
IET Generation, Transmission & Distribution (2023).https://doi.org/10.1049/ gtd2.13042
Manner, P., Tikka, V., Honkapuro, S., Tikkanen, K., Aghaei, J.: Electric vehi- cle charging as a source of Nordic fast frequency reserve — proof of concept. IET Generation, Transmission & Distribution (2023).https://doi.org/10.1049/ gtd2.13042
2023
-
[17]
Environmental Research: Energy2(4) (2025).https: //doi.org/10.1088/2753-3751/ae2486
Newkirk, A.C., Fernandez, J., Koomey, J., Latif, I., Strubell, E., Shehabi, A., Samaras, C.: Empirically-calibrated H100 node power models for accurate AI training energy estimation. Environmental Research: Energy2(4) (2025).https: //doi.org/10.1088/2753-3751/ae2486
-
[18]
In: International Journal of Parallel Programming (2023).https: //doi.org/10.1007/s10766-023-00761-w
Ottaviano, A., Bambini, G., Tortorella, Y., et al.: ControlPULP: A risc-v on-chip parallelpowercontrollerformany-corehpcprocessorswithhardware/softwarereal- time control. In: International Journal of Parallel Programming (2023).https: //doi.org/10.1007/s10766-023-00761-w
-
[19]
Ren, P., Sun, W., Wang, Y., Harrison, G.: Grid frequency stability support po- tential of data center: A quantitative assessment of flexibility (2025),https: //arxiv.org/abs/2510.01050
arXiv 2025
-
[20]
Journal of Energy – Energija72(3), 3–7 (2023).https://doi.org/ 10.37798/2023723472
Sagrestano Štambuk, P., Vrbičić Tenđera, D., Zovko, N., Tenđera, T., Uzelac, M.: Alignment of aFRR and mFRR prequalification process in Croatia with the target market design. Journal of Energy – Energija72(3), 3–7 (2023).https://doi.org/ 10.37798/2023723472
-
[21]
EuroHPC JU HORIZON-EUROHPC-JU-2023-ENERGY- 04 (2025),https://www.eurohpc-ju.europa.eu/research-innovation/ our-projects/seanergys_en
SEANERGYS Consortium: SEANERGYS: Software for efficient and energy-aware supercomputers. EuroHPC JU HORIZON-EUROHPC-JU-2023-ENERGY- 04 (2025),https://www.eurohpc-ju.europa.eu/research-innovation/ our-projects/seanergys_en
2023
-
[22]
In: 2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Work- shops)
Simmendinger, C., Marquardt, M., Mäder, J., Schiffmann, T., Wilde, T.: Power- Sched – managing power consumption in overprovisioned systems. In: 2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Work- shops). IEEE (2024)
2024
-
[23]
In: Proceedings of the 2025 IEEE International Symposium on High Performance Computer Ar- chitecture (HPCA)
Stojkovic, J., Zhang, C., Goiri, Í., Torrellas, J., Choukse, E.: DynamoLLM: Design- ing LLM inference clusters for performance and energy efficiency. In: Proceedings of the 2025 IEEE International Symposium on High Performance Computer Ar- chitecture (HPCA). IEEE (2024)
2025
-
[24]
Energy and Buildings231(2020).https://doi.org/10.1016/j.enbuild.2020
Sun, K., Luo, N., Luo, X., Hong, T.: Prototype energy models for data centers. Energy and Buildings231(2020).https://doi.org/10.1016/j.enbuild.2020. 110166 14 D.-A. Constantinescu and D. Atienza
-
[25]
Energy Reports13(2025).https://doi.org/10
Takçı, M.T., Qadrdan, M., Summers, J., Gustafsson, J.: Data centres as a source of flexibility for power systems. Energy Reports13(2025).https://doi.org/10. 1016/j.egyr.2025.04.013
2025
-
[26]
IEEE Access13, 145110–145125 (2025)
Tao, X., Gadh, R.: Fast frequency response potential of data centers through work- load modulation and UPS coordination. IEEE Access13, 145110–145125 (2025). https://doi.org/10.1109/ACCESS.2025.3646120
-
[27]
IEEE Energy Sustainability Magazine (2026),https://ieeexplore.ieee
Terzija, V., et al.: Data centers for sustainable grids: From microgrids to super- grids. IEEE Energy Sustainability Magazine (2026),https://ieeexplore.ieee. org/document/11367124/
arXiv 2026
-
[28]
Clean Energy9(2), 204–218 (2025).https://doi.org/10.1093/ce/zkae064
Varhegyi, G., Nour, M.: Integrating fast frequency response ancillary services: A global review of technical, procurement, and market integration challenges. Clean Energy9(2), 204–218 (2025).https://doi.org/10.1093/ce/zkae064
-
[29]
In: 2025 IEEE International Parallel and Distributed Pro- cessing Symposium Workshops (IPDPSW)
Velicka, D., Vysocky, O., Říha, L.: Methodology for GPU frequency switching la- tency measurement. In: 2025 IEEE International Parallel and Distributed Pro- cessing Symposium Workshops (IPDPSW). pp. 830–839. IEEE (2025).https: //doi.org/10.1109/IPDPSW66978.2025.00133
-
[30]
In: Proceedings of the 2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
van der Vlugt, S., Oostrum, L., Schoonderbeek, G., van Werkhoven, B., Veen- boer, B., Doekemeijer, K., Romein, J.W.: PowerSensor3: A fast and accurate open source power measurement tool. In: Proceedings of the 2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE (2025)
2025
-
[31]
IEEE Transactions on Sustainable Computing9(2), 128–141 (2024)
Wang, F., Hao, M., Zhang, W., Wang, Z.: Model-free GPU online energy optimiza- tion. IEEE Transactions on Sustainable Computing9(2), 128–141 (2024)
2024
-
[32]
IEEE Transactions on Sustainable Computing (2024)
Wang, Y., et al.: DRLCAP: Runtime GPU frequency capping with deep reinforce- ment learning. IEEE Transactions on Sustainable Computing (2024)
2024
-
[33]
Energy and Buildings (2024).https://doi.org/10.1016/j.enbuild.2024.114919
Zhao, J., Chen, Z.x., Li, H., Liu, D.: A model predictive control for a multi-chiller system in data center considering whole system energy conservation. Energy and Buildings (2024).https://doi.org/10.1016/j.enbuild.2024.114919
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.