Recognition: unknown
Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models
Pith reviewed 2026-05-08 16:32 UTC · model grok-4.3
The pith
Counterfactual surrogate models let Dyna-style RL train HVAC controllers with five weeks of building data instead of six to twelve months.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Counter-Dyna creates data-efficient counterfactual surrogate models (CSM) by leveraging invariances in the state-space. Using a CSM in Dyna speeds up RL training measured in environment interaction data compared to previous results. In comparison with previous state-of-the-art that used 6-12 months of environment interactions, our method needs only 5 weeks. We evaluate our method in a large simulation study using the literature standard BOPTEST framework and proximal policy algorithm (PPO) as the RL algorithm. Our results show cost-saving potentials of 5.3% to 17.0% in a hypothetical deployment scenario.
What carries the argument
Counterfactual surrogate models (CSM) that predict only controllable dynamics by exploiting invariances to exogenous state variables such as weather and electricity prices.
If this is right
- Effective HVAC policies become trainable with roughly one-tenth the real-world interaction data previously required.
- Energy cost reductions between 5.3 and 17 percent remain achievable under the reduced-data regime.
- The approach works inside the standard BOPTEST simulation framework using the PPO algorithm.
- Model-based RL for building control moves closer to real-world viability because data collection time drops from months to weeks.
Where Pith is reading between the lines
- The same invariance-leveraging trick could apply to other control tasks that mix controllable states with large exogenous signals, such as grid or traffic management.
- Physical-building experiments would be required to check whether simulation-trained policies retain their performance once sensor noise and unmodeled dynamics appear.
- Pairing the counterfactual models with transfer or meta-learning techniques might shrink the remaining data requirement even further.
Load-bearing premise
The counterfactual surrogate models accurately capture the controllable building dynamics without introducing bias that would degrade the policy when it is transferred to the real environment.
What would settle it
If a policy trained inside Counter-Dyna requires more than five weeks of interactions or produces lower cost savings than a policy trained directly on the real building in the same BOPTEST scenario, the data-efficiency claim would be falsified.
Figures
read the original abstract
Model-based reinforcement learning (MBRL) offers a promising approach for data-efficient energy management in buildings, combining the strengths of predictive modeling and reinforcement learning. While previous MBRL methods applied to HVAC control have reduced training data requirements, they still require several months of interaction with the building to learn a satisfactory control policy. A key reason is that existing surrogate models attempt to predict the entire state-space, including weather and electricity prices that are unaffected by control actions, or completely ignore these variables. Addressing these issues, we propose Counter-Dyna, a method that enhances the data-efficiency of Dyna, an MBRL method. We create data-efficient counterfactual surrogate models (CSM) by leveraging invariances in the state-space. Using a CSM in Dyna speeds up RL training measured in environment interaction data compared to previous results. In comparison with previous state-of-the-art that used 6-12 months of environment interactions, our method needs only 5 weeks. We evaluate our method in a large simulation study using the literature standard BOPTEST framework and proximal policy algorithm (PPO) as the RL algorithm. Our results show cost-saving potentials of 5.3% to 17.0% in a hypothetical deployment scenario. Our work is a significant step towards making real-world deployment of RL algorithms in HVAC control practically viable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Counter-Dyna, which augments the Dyna MBRL framework with counterfactual surrogate models (CSMs) that exploit state-space invariances to predict only action-affected variables while treating exogenous factors like weather and prices separately. Evaluated in BOPTEST simulations with PPO, the method is claimed to learn effective HVAC policies using only 5 weeks of environment interactions, versus 6-12 months in prior work, while delivering 5.3-17% cost savings.
Significance. If the CSMs prove unbiased for controllable dynamics, the result would meaningfully advance practical RL deployment for building control by lowering the data barrier. The choice of the standard BOPTEST benchmark and PPO is a positive for comparability and reproducibility.
major comments (2)
- [Abstract] Abstract: the headline claim of 5-week data efficiency (versus 6-12 months) rests on CSMs producing unbiased rollouts, yet the manuscript supplies no quantitative validation of invariance assumptions, bias metrics, or sensitivity analysis for the counterfactual targets; this is load-bearing for the speedup assertion.
- [Evaluation] Evaluation section: the reported 5.3-17% savings lack accompanying details on training/validation splits for the CSMs, number of independent runs, or statistical significance testing, making it impossible to assess whether the gains are robust or influenced by post-hoc modeling choices.
minor comments (1)
- Clarify how 'environment interaction data' is counted and normalized when comparing against prior MBRL baselines that may use different state representations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment point-by-point below and will revise the manuscript to incorporate the suggested improvements for greater clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of 5-week data efficiency (versus 6-12 months) rests on CSMs producing unbiased rollouts, yet the manuscript supplies no quantitative validation of invariance assumptions, bias metrics, or sensitivity analysis for the counterfactual targets; this is load-bearing for the speedup assertion.
Authors: We agree that the data-efficiency claim would be strengthened by explicit quantitative support for the CSM assumptions. The CSMs are constructed by design to predict only action-affected variables while treating exogenous factors (weather, prices) as separate inputs, which directly encodes the invariance. Nevertheless, we will add to the revised manuscript a dedicated analysis subsection with bias metrics (e.g., prediction error on controllable states versus ground-truth dynamics) and sensitivity tests that vary the set of assumed invariant variables. These additions will be placed in the Evaluation section to directly substantiate the unbiased-rollout premise. revision: yes
-
Referee: [Evaluation] Evaluation section: the reported 5.3-17% savings lack accompanying details on training/validation splits for the CSMs, number of independent runs, or statistical significance testing, making it impossible to assess whether the gains are robust or influenced by post-hoc modeling choices.
Authors: We concur that reproducibility and robustness assessment require these details. The original experiments used the first five weeks of BOPTEST data to train the CSMs with an 80/20 train/validation split and aggregated results across multiple random seeds, yet these specifics were not stated explicitly. In the revision we will report the exact split, the number of independent runs (ten), standard deviations or error bars on all metrics, and paired statistical tests (e.g., t-tests) against the baselines to confirm that the observed savings are statistically significant and not artifacts of modeling choices. revision: yes
Circularity Check
No significant circularity; empirical evaluation independent of inputs
full rationale
The paper proposes CSMs by leveraging state-space invariances, integrates them into Dyna-style MBRL with PPO, and reports measured data-efficiency (5 weeks vs prior 6-12 months) and savings (5.3-17%) from direct experiments in the external BOPTEST simulator. No equations, fitted parameters, or self-citations are shown that reduce these outcomes to definitional equivalences or input data by construction. The invariance-based modeling choice is a design decision whose validity is tested externally rather than assumed tautologically. This is a standard empirical MBRL pipeline with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The building dynamics can be decomposed into controllable and uncontrollable components that remain independent of control actions.
invented entities (1)
-
Counterfactual surrogate model (CSM)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Thibaut Abergel, Brian Dean, and John Dulac. 2017. Towards a zero-emission, efficient, and resilient buildings and construction sector: Global Status Report 2017.UN Environment and International Energy Agency: Paris, France22 (2017)
2017
-
[2]
Khalil Al Sayed, Abhinandana Boodi, Roozbeh Sadeghian Broujeny, and Karim Beddiar. 2024. Reinforcement learning for HVAC control in intelligent buildings: A technical and conceptual review.Journal of Building Engineering(2024), 110085
2024
-
[3]
Javier Arroyo, Carlo Manna, Fred Spiessens, Lieve Helsen, D Saelens, J Laverge, W Boydens, and L Helsen. 2022. An OpenAI-gym environment for the building optimization testing (BOPTEST) framework. InProceedings of Building Simula- tion 2021: 17th Conference of IBPSA, Vol. 17. INT BUILDING PERFORMANCE SIMULATION ASSOC-IBPSA, 175–182
2022
-
[4]
David Blum, Javier Arroyo, Sen Huang, Ján Drgoňa, Filip Jorissen, Harald Taxt Walnum, Yan Chen, Kyle Benne, Draguna Vrabie, Michael Wetter, et al. 2021. Building optimization testing framework (BOPTEST) for simulation-based bench- marking of control strategies in buildings.Journal of Building Performance Simulation14, 5 (2021), 586–610
2021
-
[5]
Bingqing Chen, Zicheng Cai, and Mario Bergés. 2019. Gnu-RL: A Precocial Reinforcement Learning Solution for Building HVAC Control Using a Differ- entiable MPC Policy. InProceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation(New York, NY, USA)(BuildSys ’19). Association for Computing Machin...
-
[6]
Liangliang Chen, Fei Meng, and Ying Zhang. 2022. MBRL-MC: An HVAC control approach via combining model-based deep reinforcement learning and model predictive control.IEEE Internet of Things Journal9, 19 (2022), 19160–19173
2022
-
[7]
Xianzhong Ding, Zhiyu An, Arya Rathee, and Wan Du. 2025. A Safe and Data- Efficient Model-Based Reinforcement Learning System for HVAC Control.IEEE Internet of Things Journal12, 7 (2025), 8014–8032. doi:10.1109/JIOT.2025.3540402
-
[8]
Xianzhong Ding, Wan Du, and Alberto E Cerpa. 2020. Mb2c: Model-based deep reinforcement learning for multi-zone building control. InProceedings of the 7th ACM international conference on systems for energy-efficient buildings, cities, and transportation. 50–59
2020
-
[9]
Ján Drgoňa, Javier Arroyo, Iago Cupeiro Figueroa, David Blum, Krzysztof Arendt, Donghun Kim, Enric Perarnau Ollé, Juraj Oravec, Michael Wetter, Draguna L Vrabie, et al. 2020. All you need to know about model predictive control for buildings.Annual Reviews in Control50 (2020), 190–232
2020
-
[10]
Qiming Fu, Zhicong Han, Jianping Chen, You Lu, Hongjie Wu, and Yunzhe Wang
-
[11]
Applications of reinforcement learning for building energy efficiency control: A review.Journal of Building Engineering50 (2022), 104165
2022
-
[12]
Yangyang Fu, Shichao Xu, Qi Zhu, Zheng O’Neill, and Veronica Adetola. 2023. How good are learning-based control vs model-based control for load shifting? Investigations on a single zone building energy system.Energy273 (2023), 127073
2023
-
[13]
Cheng Gao and Dan Wang. 2023. Comparative study of model-based and model- free reinforcement learning control performance in HVAC systems.Journal of Building Engineering74 (2023), 106852
2023
-
[14]
Hao Gao, Christian Koch, and Yupeng Wu. 2019. Building information modelling based building energy modelling: A review.Applied energy238 (2019), 320–343
2019
-
[15]
Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Se- hoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. 2019. Soft Actor-Critic Algorithms and Applications. arXiv:1812.05905 [cs.LG] https://arxiv.org/abs/1812.05905
work page internal anchor Pith review arXiv 2019
-
[16]
Mengjie Han, Ross May, Xingxing Zhang, Xinru Wang, Song Pan, Da Yan, Yuan Jin, and Liguo Xu. 2019. A review of reinforcement learning methodologies for controlling occupant comfort in buildings.Sustainable cities and society51 (2019), 101748
2019
-
[17]
Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. 2019. When to trust your model: Model-based policy optimization.Advances in neural information processing systems32 (2019)
2019
-
[18]
Anjukan Kathirgamanathan, Eleni Mangina, and Donal P Finn. 2021. Develop- ment of a soft actor critic deep reinforcement learning approach for harnessing energy flexibility in a large office building.Energy and AI5 (2021), 100101
2021
-
[19]
Hsin-Yu Liu, Bharathan Balaji, Sicun Gao, Rajesh Gupta, and Dezhi Hong. 2022. Safe hvac control via batch reinforcement learning. In2022 ACM/IEEE 13th Inter- national Conference on Cyber-Physical Systems (ICCPS). IEEE, 181–192
2022
-
[20]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602(2013)
work page internal anchor Pith review arXiv 2013
-
[21]
Zoltan Nagy, Gregor Henze, Sourav Dey, Javier Arroyo, Lieve Helsen, Xiangyu Zhang, Bingqing Chen, Kadir Amasyali, Kuldeep Kurte, Ahmed Zamzam, He- lia Zandi, Ján Drgoňa, Matias Quintana, Steven McCullogh, June Young Park, Han Li, Tianzhen Hong, Silvio Brandi, Giuseppe Pinto, Alfonso Capozzoli, Draguna Vrabie, Mario Bergés, Kingsley Nweye, Thibault Marzull...
-
[22]
2009.Causality
Judea Pearl. 2009.Causality. Cambridge university press
2009
-
[23]
Fabian Raisch, Thomas Krug, Christoph Goebel, and Benjamin Tischler. 2025. GenTL: A General Transfer Learning Model for Building Thermal Dynamics. In Proceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems (E-Energy ’25). Association for Computing Machinery, New York, NY, USA, 322–333. doi:10.1145/3679240.3734589
-
[24]
Fabian Raisch, Max Langtry, Felix Koch, Ruchi Choudhary, Christoph Goebel, and Benjamin Tischler. 2026. Adapting to change: A comparison of continual and transfer learning for modeling building thermal dynamics under concept drifts.Energy and Buildings354 (2026), 116868. doi:10.1016/j.enbuild.2025.116868
-
[25]
Anil V Rao. 2009. A survey of numerical methods for optimal control.Advances in the astronautical Sciences135, 1 (2009), 497–528
2009
-
[26]
Muhammad Hafeez Saeed, Hussain Kazmi, and Geert Deconinck. 2024. Dyna- PINN: Physics-informed deep dyna-q reinforcement learning for intelligent con- trol of building heating system in low-diversity training data regimes.Energy and Buildings324 (2024), 114879
2024
-
[27]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
-
[28]
Proximal Policy Optimization Algorithms
Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs.LG] https: //arxiv.org/abs/1707.06347
work page internal anchor Pith review arXiv
-
[29]
Francesco Smarra, Achin Jain, Tullio de Rubeis, Dario Ambrosini, Alessandro D’Innocenzo, and Rahul Mangharam. 2018. Data-driven model predictive con- trol using random forests for building energy optimization and climate control. Applied Energy226 (2018), 1252–1272. doi:10.1016/j.apenergy.2018.02.126
-
[30]
Richard S Sutton. 1991. Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin2, 4 (1991), 160–163
1991
-
[31]
José R Vázquez-Canteli and Zoltán Nagy. 2019. Reinforcement learning for demand response: A review of algorithms and modeling techniques.Applied energy235 (2019), 1072–1089
2019
-
[32]
Dan Wang, Wanfu Zheng, Zhe Wang, Yaran Wang, Xiufeng Pang, and Wei Wang
-
[33]
Comparison of reinforcement learning and model predictive control for building energy system optimization.Applied Thermal Engineering228 (2023), 120430
2023
-
[34]
Xiangwei Wang, Peng Wang, Renke Huang, Xiuli Zhu, Javier Arroyo, and Ning Li. 2025. Safe deep reinforcement learning for building energy management. Applied Energy377 (2025), 124328. doi:10.1016/j.apenergy.2024.124328
-
[35]
Zhe Wang and Tianzhen Hong. 2020. Reinforcement learning for building con- trols: The opportunities and challenges.Applied Energy269 (2020), 115036
2020
-
[36]
Tianshu Wei, Yanzhi Wang, and Qi Zhu. 2017. Deep reinforcement learning for building HVAC control. InProceedings of the 54th annual design automation conference 2017. 1–6
2017
-
[37]
Liang Yu, Shuqi Qin, Meng Zhang, Chao Shen, Tao Jiang, and Xiaohong Guan
-
[38]
A review of deep reinforcement learning for smart building energy man- agement.IEEE Internet of Things Journal8, 15 (2021), 12046–12063
2021
-
[39]
Chi Zhang, Sanmukh Rao Kuppannagari, and Viktor K Prasanna. 2022. Safe building HVAC control via batch reinforcement learning.IEEE Transactions on Sustainable Computing7, 4 (2022), 923–934. A Comparison of Counter-Dyna and Model-free Soft-Actor-Critic with Continuous Actions To show our method for more than one RL algorithm and a dif- ferent action-space,...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.