arxiv: 2605.04555 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.SY· eess.SY

Recognition: unknown

Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models

Jan Marco Ruiz de Vargas , Fabian Raisch , Zoltan Nagy , Pierre Pinson , Christoph Goebel

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:32 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY

keywords HVAC controlreinforcement learningmodel-based RLdata-efficient learningcounterfactual modelsbuilding energy managementDyna algorithmBOPTEST

0 comments

The pith

Counterfactual surrogate models let Dyna-style RL train HVAC controllers with five weeks of building data instead of six to twelve months.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Counter-Dyna to improve the data efficiency of model-based reinforcement learning for building HVAC control. It replaces standard surrogate models with counterfactual ones that exploit state-space invariances to predict only the dynamics affected by control actions, leaving weather, prices, and other exogenous variables out of the learning loop. This change allows the Dyna algorithm to reach effective policies after far fewer real interactions with the environment. Prior methods needed six to twelve months of data; the new approach succeeds with five weeks while still producing 5.3 to 17 percent cost reductions in BOPTEST simulations. A reader would care because data collection in occupied buildings is costly and slow, so lowering the requirement makes reinforcement learning practical for everyday energy management.

Core claim

Counter-Dyna creates data-efficient counterfactual surrogate models (CSM) by leveraging invariances in the state-space. Using a CSM in Dyna speeds up RL training measured in environment interaction data compared to previous results. In comparison with previous state-of-the-art that used 6-12 months of environment interactions, our method needs only 5 weeks. We evaluate our method in a large simulation study using the literature standard BOPTEST framework and proximal policy algorithm (PPO) as the RL algorithm. Our results show cost-saving potentials of 5.3% to 17.0% in a hypothetical deployment scenario.

What carries the argument

Counterfactual surrogate models (CSM) that predict only controllable dynamics by exploiting invariances to exogenous state variables such as weather and electricity prices.

If this is right

Effective HVAC policies become trainable with roughly one-tenth the real-world interaction data previously required.
Energy cost reductions between 5.3 and 17 percent remain achievable under the reduced-data regime.
The approach works inside the standard BOPTEST simulation framework using the PPO algorithm.
Model-based RL for building control moves closer to real-world viability because data collection time drops from months to weeks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same invariance-leveraging trick could apply to other control tasks that mix controllable states with large exogenous signals, such as grid or traffic management.
Physical-building experiments would be required to check whether simulation-trained policies retain their performance once sensor noise and unmodeled dynamics appear.
Pairing the counterfactual models with transfer or meta-learning techniques might shrink the remaining data requirement even further.

Load-bearing premise

The counterfactual surrogate models accurately capture the controllable building dynamics without introducing bias that would degrade the policy when it is transferred to the real environment.

What would settle it

If a policy trained inside Counter-Dyna requires more than five weeks of interactions or produces lower cost savings than a policy trained directly on the real building in the same BOPTEST scenario, the data-efficiency claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.04555 by Christoph Goebel, Fabian Raisch, Jan Marco Ruiz de Vargas, Pierre Pinson, Zoltan Nagy.

**Figure 1.** Figure 1: Left: Flowchart illustrating the Dyna workflow. Right: Based on a point in the real episode (blue), the surrogate model view at source ↗

**Figure 2.** Figure 2: Causal diagram describing the causal relationships view at source ↗

**Figure 3.** Figure 3: Average episodic rewards. 1 week = 1 episode. Left: Counter-Dyna runs for 5 and 10 weeks. Right: Model-free runs for view at source ↗

**Figure 4.** Figure 4: Control performance of the best runs per category during the peak heating period. view at source ↗

**Figure 5.** Figure 5: Control performance of the best runs per category during the typical heating period. view at source ↗

**Figure 6.** Figure 6: Scatterplot of cost and discomfort for the best runs view at source ↗

**Figure 7.** Figure 7: PPO box plots of cost and discomfort for both testing periods, with distributions over the 30 different seeds. view at source ↗

**Figure 8.** Figure 8: Top: Comparing zone temperatures for in-sample view at source ↗

**Figure 10.** Figure 10: SAC box plots of cost and discomfort for both testing periods, with distributions over the 30 different seeds. view at source ↗

**Figure 11.** Figure 11: Learning curves for different synth:real data ratios. view at source ↗

**Figure 12.** Figure 12: Cost vs. discomfort when varying the synth:real view at source ↗

**Figure 15.** Figure 15: Cost vs. discomfort when varying the cost weight view at source ↗

read the original abstract

Model-based reinforcement learning (MBRL) offers a promising approach for data-efficient energy management in buildings, combining the strengths of predictive modeling and reinforcement learning. While previous MBRL methods applied to HVAC control have reduced training data requirements, they still require several months of interaction with the building to learn a satisfactory control policy. A key reason is that existing surrogate models attempt to predict the entire state-space, including weather and electricity prices that are unaffected by control actions, or completely ignore these variables. Addressing these issues, we propose Counter-Dyna, a method that enhances the data-efficiency of Dyna, an MBRL method. We create data-efficient counterfactual surrogate models (CSM) by leveraging invariances in the state-space. Using a CSM in Dyna speeds up RL training measured in environment interaction data compared to previous results. In comparison with previous state-of-the-art that used 6-12 months of environment interactions, our method needs only 5 weeks. We evaluate our method in a large simulation study using the literature standard BOPTEST framework and proximal policy algorithm (PPO) as the RL algorithm. Our results show cost-saving potentials of 5.3% to 17.0% in a hypothetical deployment scenario. Our work is a significant step towards making real-world deployment of RL algorithms in HVAC control practically viable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Counter-Dyna cuts simulated RL data needs for HVAC to five weeks by building counterfactual surrogates that model only action-affected dynamics, but the bias-free claim for those models is still untested outside BOPTEST.

read the letter

Counter-Dyna tries to make model-based RL practical for HVAC by building surrogate models that only predict the dynamics the controller can influence. The authors use invariances in the state space to create counterfactual targets that ignore exogenous inputs like weather and prices. In their BOPTEST simulations with PPO, this cuts the required real-world interaction data to five weeks while still achieving 5.3 to 17 percent cost savings.

Referee Report

2 major / 1 minor

Summary. The paper proposes Counter-Dyna, which augments the Dyna MBRL framework with counterfactual surrogate models (CSMs) that exploit state-space invariances to predict only action-affected variables while treating exogenous factors like weather and prices separately. Evaluated in BOPTEST simulations with PPO, the method is claimed to learn effective HVAC policies using only 5 weeks of environment interactions, versus 6-12 months in prior work, while delivering 5.3-17% cost savings.

Significance. If the CSMs prove unbiased for controllable dynamics, the result would meaningfully advance practical RL deployment for building control by lowering the data barrier. The choice of the standard BOPTEST benchmark and PPO is a positive for comparability and reproducibility.

major comments (2)

[Abstract] Abstract: the headline claim of 5-week data efficiency (versus 6-12 months) rests on CSMs producing unbiased rollouts, yet the manuscript supplies no quantitative validation of invariance assumptions, bias metrics, or sensitivity analysis for the counterfactual targets; this is load-bearing for the speedup assertion.
[Evaluation] Evaluation section: the reported 5.3-17% savings lack accompanying details on training/validation splits for the CSMs, number of independent runs, or statistical significance testing, making it impossible to assess whether the gains are robust or influenced by post-hoc modeling choices.

minor comments (1)

Clarify how 'environment interaction data' is counted and normalized when comparing against prior MBRL baselines that may use different state representations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point-by-point below and will revise the manuscript to incorporate the suggested improvements for greater clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of 5-week data efficiency (versus 6-12 months) rests on CSMs producing unbiased rollouts, yet the manuscript supplies no quantitative validation of invariance assumptions, bias metrics, or sensitivity analysis for the counterfactual targets; this is load-bearing for the speedup assertion.

Authors: We agree that the data-efficiency claim would be strengthened by explicit quantitative support for the CSM assumptions. The CSMs are constructed by design to predict only action-affected variables while treating exogenous factors (weather, prices) as separate inputs, which directly encodes the invariance. Nevertheless, we will add to the revised manuscript a dedicated analysis subsection with bias metrics (e.g., prediction error on controllable states versus ground-truth dynamics) and sensitivity tests that vary the set of assumed invariant variables. These additions will be placed in the Evaluation section to directly substantiate the unbiased-rollout premise. revision: yes
Referee: [Evaluation] Evaluation section: the reported 5.3-17% savings lack accompanying details on training/validation splits for the CSMs, number of independent runs, or statistical significance testing, making it impossible to assess whether the gains are robust or influenced by post-hoc modeling choices.

Authors: We concur that reproducibility and robustness assessment require these details. The original experiments used the first five weeks of BOPTEST data to train the CSMs with an 80/20 train/validation split and aggregated results across multiple random seeds, yet these specifics were not stated explicitly. In the revision we will report the exact split, the number of independent runs (ten), standard deviations or error bars on all metrics, and paired statistical tests (e.g., t-tests) against the baselines to confirm that the observed savings are statistically significant and not artifacts of modeling choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation independent of inputs

full rationale

The paper proposes CSMs by leveraging state-space invariances, integrates them into Dyna-style MBRL with PPO, and reports measured data-efficiency (5 weeks vs prior 6-12 months) and savings (5.3-17%) from direct experiments in the external BOPTEST simulator. No equations, fitted parameters, or self-citations are shown that reduce these outcomes to definitional equivalences or input data by construction. The invariance-based modeling choice is a design decision whose validity is tested externally rather than assumed tautologically. This is a standard empirical MBRL pipeline with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of useful invariances in the building state space that allow construction of accurate counterfactual models without access to full exogenous trajectories. No free parameters or invented entities are quantified in the abstract.

axioms (1)

domain assumption The building dynamics can be decomposed into controllable and uncontrollable components that remain independent of control actions.
Invoked to justify the counterfactual surrogate model construction.

invented entities (1)

Counterfactual surrogate model (CSM) no independent evidence
purpose: Surrogate that predicts only the controllable part of the state for use inside Dyna-style MBRL.
New modeling construct introduced to achieve data efficiency.

pith-pipeline@v0.9.0 · 5557 in / 1288 out tokens · 27536 ms · 2026-05-08T16:32:21.559432+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 10 canonical work pages · 3 internal anchors

[1]

Thibaut Abergel, Brian Dean, and John Dulac. 2017. Towards a zero-emission, efficient, and resilient buildings and construction sector: Global Status Report 2017.UN Environment and International Energy Agency: Paris, France22 (2017)

2017
[2]

Khalil Al Sayed, Abhinandana Boodi, Roozbeh Sadeghian Broujeny, and Karim Beddiar. 2024. Reinforcement learning for HVAC control in intelligent buildings: A technical and conceptual review.Journal of Building Engineering(2024), 110085

2024
[3]

Javier Arroyo, Carlo Manna, Fred Spiessens, Lieve Helsen, D Saelens, J Laverge, W Boydens, and L Helsen. 2022. An OpenAI-gym environment for the building optimization testing (BOPTEST) framework. InProceedings of Building Simula- tion 2021: 17th Conference of IBPSA, Vol. 17. INT BUILDING PERFORMANCE SIMULATION ASSOC-IBPSA, 175–182

2022
[4]

David Blum, Javier Arroyo, Sen Huang, Ján Drgoňa, Filip Jorissen, Harald Taxt Walnum, Yan Chen, Kyle Benne, Draguna Vrabie, Michael Wetter, et al. 2021. Building optimization testing framework (BOPTEST) for simulation-based bench- marking of control strategies in buildings.Journal of Building Performance Simulation14, 5 (2021), 586–610

2021
[5]

Bingqing Chen, Zicheng Cai, and Mario Bergés. 2019. Gnu-RL: A Precocial Reinforcement Learning Solution for Building HVAC Control Using a Differ- entiable MPC Policy. InProceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation(New York, NY, USA)(BuildSys ’19). Association for Computing Machin...

work page doi:10.1145/3360322.3360849 2019
[6]

Liangliang Chen, Fei Meng, and Ying Zhang. 2022. MBRL-MC: An HVAC control approach via combining model-based deep reinforcement learning and model predictive control.IEEE Internet of Things Journal9, 19 (2022), 19160–19173

2022
[7]

Xianzhong Ding, Zhiyu An, Arya Rathee, and Wan Du. 2025. A Safe and Data- Efficient Model-Based Reinforcement Learning System for HVAC Control.IEEE Internet of Things Journal12, 7 (2025), 8014–8032. doi:10.1109/JIOT.2025.3540402

work page doi:10.1109/jiot.2025.3540402 2025
[8]

Xianzhong Ding, Wan Du, and Alberto E Cerpa. 2020. Mb2c: Model-based deep reinforcement learning for multi-zone building control. InProceedings of the 7th ACM international conference on systems for energy-efficient buildings, cities, and transportation. 50–59

2020
[9]

Ján Drgoňa, Javier Arroyo, Iago Cupeiro Figueroa, David Blum, Krzysztof Arendt, Donghun Kim, Enric Perarnau Ollé, Juraj Oravec, Michael Wetter, Draguna L Vrabie, et al. 2020. All you need to know about model predictive control for buildings.Annual Reviews in Control50 (2020), 190–232

2020
[10]

Qiming Fu, Zhicong Han, Jianping Chen, You Lu, Hongjie Wu, and Yunzhe Wang
[11]

Applications of reinforcement learning for building energy efficiency control: A review.Journal of Building Engineering50 (2022), 104165

2022
[12]

Yangyang Fu, Shichao Xu, Qi Zhu, Zheng O’Neill, and Veronica Adetola. 2023. How good are learning-based control vs model-based control for load shifting? Investigations on a single zone building energy system.Energy273 (2023), 127073

2023
[13]

Cheng Gao and Dan Wang. 2023. Comparative study of model-based and model- free reinforcement learning control performance in HVAC systems.Journal of Building Engineering74 (2023), 106852

2023
[14]

Hao Gao, Christian Koch, and Yupeng Wu. 2019. Building information modelling based building energy modelling: A review.Applied energy238 (2019), 320–343

2019
[15]

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Se- hoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. 2019. Soft Actor-Critic Algorithms and Applications. arXiv:1812.05905 [cs.LG] https://arxiv.org/abs/1812.05905

work page internal anchor Pith review arXiv 2019
[16]

Mengjie Han, Ross May, Xingxing Zhang, Xinru Wang, Song Pan, Da Yan, Yuan Jin, and Liguo Xu. 2019. A review of reinforcement learning methodologies for controlling occupant comfort in buildings.Sustainable cities and society51 (2019), 101748

2019
[17]

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. 2019. When to trust your model: Model-based policy optimization.Advances in neural information processing systems32 (2019)

2019
[18]

Anjukan Kathirgamanathan, Eleni Mangina, and Donal P Finn. 2021. Develop- ment of a soft actor critic deep reinforcement learning approach for harnessing energy flexibility in a large office building.Energy and AI5 (2021), 100101

2021
[19]

Hsin-Yu Liu, Bharathan Balaji, Sicun Gao, Rajesh Gupta, and Dezhi Hong. 2022. Safe hvac control via batch reinforcement learning. In2022 ACM/IEEE 13th Inter- national Conference on Cyber-Physical Systems (ICCPS). IEEE, 181–192

2022
[20]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602(2013)

work page internal anchor Pith review arXiv 2013
[21]

Zoltan Nagy, Gregor Henze, Sourav Dey, Javier Arroyo, Lieve Helsen, Xiangyu Zhang, Bingqing Chen, Kadir Amasyali, Kuldeep Kurte, Ahmed Zamzam, He- lia Zandi, Ján Drgoňa, Matias Quintana, Steven McCullogh, June Young Park, Han Li, Tianzhen Hong, Silvio Brandi, Giuseppe Pinto, Alfonso Capozzoli, Draguna Vrabie, Mario Bergés, Kingsley Nweye, Thibault Marzull...

work page doi:10.1016/j.buildenv.2023.110435 2023
[22]

2009.Causality

Judea Pearl. 2009.Causality. Cambridge university press

2009
[23]

Fabian Raisch, Thomas Krug, Christoph Goebel, and Benjamin Tischler. 2025. GenTL: A General Transfer Learning Model for Building Thermal Dynamics. In Proceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems (E-Energy ’25). Association for Computing Machinery, New York, NY, USA, 322–333. doi:10.1145/3679240.3734589

work page doi:10.1145/3679240.3734589 2025
[24]

Fabian Raisch, Max Langtry, Felix Koch, Ruchi Choudhary, Christoph Goebel, and Benjamin Tischler. 2026. Adapting to change: A comparison of continual and transfer learning for modeling building thermal dynamics under concept drifts.Energy and Buildings354 (2026), 116868. doi:10.1016/j.enbuild.2025.116868

work page doi:10.1016/j.enbuild.2025.116868 2026
[25]

Anil V Rao. 2009. A survey of numerical methods for optimal control.Advances in the astronautical Sciences135, 1 (2009), 497–528

2009
[26]

Muhammad Hafeez Saeed, Hussain Kazmi, and Geert Deconinck. 2024. Dyna- PINN: Physics-informed deep dyna-q reinforcement learning for intelligent con- trol of building heating system in low-diversity training data regimes.Energy and Buildings324 (2024), 114879

2024
[27]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
[28]

Proximal Policy Optimization Algorithms

Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs.LG] https: //arxiv.org/abs/1707.06347

work page internal anchor Pith review arXiv
[29]

Francesco Smarra, Achin Jain, Tullio de Rubeis, Dario Ambrosini, Alessandro D’Innocenzo, and Rahul Mangharam. 2018. Data-driven model predictive con- trol using random forests for building energy optimization and climate control. Applied Energy226 (2018), 1252–1272. doi:10.1016/j.apenergy.2018.02.126

work page doi:10.1016/j.apenergy.2018.02.126 2018
[30]

Richard S Sutton. 1991. Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin2, 4 (1991), 160–163

1991
[31]

José R Vázquez-Canteli and Zoltán Nagy. 2019. Reinforcement learning for demand response: A review of algorithms and modeling techniques.Applied energy235 (2019), 1072–1089

2019
[32]

Dan Wang, Wanfu Zheng, Zhe Wang, Yaran Wang, Xiufeng Pang, and Wei Wang
[33]

Comparison of reinforcement learning and model predictive control for building energy system optimization.Applied Thermal Engineering228 (2023), 120430

2023
[34]

Xiangwei Wang, Peng Wang, Renke Huang, Xiuli Zhu, Javier Arroyo, and Ning Li. 2025. Safe deep reinforcement learning for building energy management. Applied Energy377 (2025), 124328. doi:10.1016/j.apenergy.2024.124328

work page doi:10.1016/j.apenergy.2024.124328 2025
[35]

Zhe Wang and Tianzhen Hong. 2020. Reinforcement learning for building con- trols: The opportunities and challenges.Applied Energy269 (2020), 115036

2020
[36]

Tianshu Wei, Yanzhi Wang, and Qi Zhu. 2017. Deep reinforcement learning for building HVAC control. InProceedings of the 54th annual design automation conference 2017. 1–6

2017
[37]

Liang Yu, Shuqi Qin, Meng Zhang, Chao Shen, Tao Jiang, and Xiaohong Guan
[38]

A review of deep reinforcement learning for smart building energy man- agement.IEEE Internet of Things Journal8, 15 (2021), 12046–12063

2021
[39]

Chi Zhang, Sanmukh Rao Kuppannagari, and Viktor K Prasanna. 2022. Safe building HVAC control via batch reinforcement learning.IEEE Transactions on Sustainable Computing7, 4 (2022), 923–934. A Comparison of Counter-Dyna and Model-free Soft-Actor-Critic with Continuous Actions To show our method for more than one RL algorithm and a dif- ferent action-space,...

2022