arxiv: 2605.05088 · v1 · submitted 2026-05-06 · 💻 cs.LG · physics.soc-ph

Recognition: unknown

Gated Multimodal Learning for Interpretable Property Energy Performance Prediction and Retrofit Scenario Analysis

Aaron Tesfa Tsion, Barbara Shollock, Raul Rosales, Wei He, Yunfei Bai

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:19 UTC · model grok-4.3

classification 💻 cs.LG physics.soc-ph

keywords gated multimodal learningenergy performance certificatesSAP scoresbuilding retrofit analysisinterpretabilityGIS spatial featuresneural network fusion

0 comments

The pith

A gated multimodal model predicts building energy efficiency scores more accurately by learning property-specific weights for tabular data, assessor text, and spatial features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a neural network that fuses three sources of building information to forecast Standard Assessment Procedure energy efficiency scores and Environmental Impact scores. It uses a gating layer to assign different importance to each data type for every individual property, plus an auxiliary task that classifies the score into bands to help training. In the Westminster case study the combined model reduces prediction error compared with versions that use only one or two of the data sources. The same architecture produces explanations through feature attribution methods and runs what-if calculations for adding wall insulation, roof insulation, or new windows. These outputs are intended to support faster screening of retrofit options across many properties without requiring an on-site visit for each one.

Core claim

The gated multimodal network that learns per-property weights over EPC tabular fields, assessor free text, and GIS-derived geometry achieves mean absolute errors of 4.03 on SAP scores and 4.76 on EI scores with R-squared values of 0.757 and 0.748; full fusion of all three modalities outperforms any unimodal or bimodal ablation, while the auxiliary band-classification head stabilizes regression training.

What carries the argument

Sample-wise gating mechanism that produces a set of modality weights for each individual property, allowing the model to emphasize text, tables, or spatial features differently depending on the building.

If this is right

City-scale energy assessments become feasible using existing certificate databases and public maps instead of new physical inspections.
Gating weights, SHAP values, text occlusion scores, and spatial attribution maps together indicate which building attributes most influence each predicted score.
Scenario runs for wall, roof, and glazing upgrades produce concrete estimates of resulting changes in annual energy cost and equivalent CO2 emissions.
The auxiliary band-classification objective improves the numerical stability of the continuous score predictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gating approach could be retrained on data from other regions provided the input fields remain comparable in format and coverage.
The per-property explanations might be used to rank buildings by retrofit urgency before sending inspectors.
Linking the model to metered consumption records could supply an additional supervision signal if such data are obtained.

Load-bearing premise

The patterns learned from one London borough will hold for buildings elsewhere and the post-hoc attribution methods will correctly identify the features that actually drive the true energy performance.

What would settle it

Train the model on the Westminster data and then evaluate its mean absolute error and feature-ranking stability on a fresh set of EPC records and GIS maps from a different UK city or borough.

Figures

Figures reproduced from arXiv: 2605.05088 by Aaron Tesfa Tsion, Barbara Shollock, Raul Rosales, Wei He, Yunfei Bai.

**Figure 1.** Figure 1: GIS layers of the Westminster study area: (a) Westminster boundary overlaid on the base map with Lower Layer view at source ↗

**Figure 2.** Figure 2: Multimodal model framework. where M denotes the number of categorical fields. Numerical features are encoded using a two-layer multilayer perception (MLP), producing the numerical representation n. The concatenated representation [c, n] is then projected into a shared latent space of dimension d, yielding the tabular embedding: Ztab ∈ R d (7) For the text modality, a Transformer encoder pre-trained on lar… view at source ↗

**Figure 3.** Figure 3: The model is trained end-to-end using the Adam optimiser [29] with layer-wise learning rates to accommodate heterogeneous initialisation scales: the pre-trained Transformer backbone uses 1 × 10−5 , its projection layers use 1 × 10−4 , and all remaining modules use 1 × 10−3 . A validation-loss-based scheduler is applied to halve the learning rate when no improvement is observed for five consecutive epochs.… view at source ↗

**Figure 3.** Figure 3: Jointly stratified training, validation, and test sets based on property type, SAP band, and EI band. view at source ↗

**Figure 4.** Figure 4: Training loss and validation metrics. 4.3. Multimodal Ablation Study To systematically evaluate the contribution of each modality in the proposed multimodal framework, a modality ablation study was conducted on the test set. Seven model configurations were evaluated, including three single-modality models (Tabular, Text, and Spatial), three dual-modality models (Tabular+Text, Tabular+Spatial, and Text+Spat… view at source ↗

**Figure 5.** Figure 5: Results of modality ablation: (a) Modality comparison in terms of band classification accuracy and R view at source ↗

**Figure 6.** Figure 6: Predicted versus actual SAP and EI regression results by property type: (a–c) SAP predictions for Flats, Houses, view at source ↗

**Figure 7.** Figure 7: Prediction results of SAP and EI scores and bands across different built forms: the left figure shows the R view at source ↗

**Figure 8.** Figure 8: Prediction results of SAP and EI scores and bands across construction age band: the left figure shows the R view at source ↗

**Figure 9.** Figure 9: presents the distribution of sample-wise fusion weights produced by the gated fusion mechanism. Overall, the model relies predominantly on the text modality when predicting SAP and EI, followed by the spatial modality, while the tabular modality contributes the least. The dominance of the text modality can be attributed to the nature of the information it encodes. EPC textual fields provide fine-grained de… view at source ↗

**Figure 10.** Figure 10: Tabular feature importance analysis: (a) SAP; (b) EI. view at source ↗

**Figure 11.** Figure 11: presents the overall results. For both SAP and EI, roof description emerges as the most influential textual feature, followed by wall description, while heating system description contributes the most among system-related fields. From a physical perspective, fabric-related elements such as roofs and walls directly govern heating demand, energy consumption, and associated costs and emissions view at source ↗

**Figure 12.** Figure 12: Spatial numerical feature importance. 5.4.2. Spatial Geometry Feature Importance view at source ↗

**Figure 13.** Figure 13: Spatial boundary shape information importance analysis. view at source ↗

**Figure 14.** Figure 14: Point-level saliency of boundary. 6. Scenario-Based Retrofit Analysis in Westminster To demonstrate the practical utility of the proposed framework, three retrofit scenarios were evaluated for properties within Westminster: wall insulation, roof insulation, and window glazing upgrades. Across the study area, 100,701 properties were identified as requiring wall insulation, 22,082 as requiring roof insulati… view at source ↗

**Figure 15.** Figure 15: Spatial distribution of projected retrofit benefits in Westminster under three intervention scenarios. Panels (a)–(c) view at source ↗

**Figure 16.** Figure 16: Projected annual energy cost and equivalent CO view at source ↗

read the original abstract

Achieving resilient and sustainable cities requires scalable approaches to decarbonising residential buildings, which account for about 20% of UK greenhouse gas emissions and 25% of energy-related emissions in the European Union. Energy Performance Certificates (EPCs) support regulation and retrofit planning, but their reliance on on-site inspections limits timely city-scale assessment. This study introduces a gated multimodal model to predict Standard Assessment Procedure (SAP) energy efficiency and Environmental Impact (EI) scores by integrating EPC tabular variables, assessor-written free text, and Geographic Information System (GIS)-derived spatial features describing footprint geometry, height, area, and orientation. Sample-wise gating learns property-specific modality weights, while an auxiliary band classification head stabilises training. In a Westminster, London case study, the model predicts SAP and EI scores with MAEs of 4.03 and 4.76 points and R2 values of 0.757 and 0.748, respectively, achieving a mean MAE of 4.39. Ablation results show that full multimodal fusion outperforms unimodal and bimodal baselines for both score prediction and band-level classification. Interpretability analyses provide decision-relevant evidence: gating weights indicate strong reliance on assessor text; SHAP highlights main fuel, built form, and construction age band; text occlusion prioritises roof and wall fields; and spatial attribution is dominated by height and footprint area, with sensitivity to footprint shape. The validated framework is further applied to retrofit scenarios for wall insulation, roof insulation, and window glazing upgrades, indicating projected improvements in SAP, EI, annual energy cost, and equivalent CO2 emissions. Overall, the framework provides scalable property-level evidence for retrofit screening, intervention prioritisation, and net-zero housing transitions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The gated multimodal model delivers usable SAP/EI predictions and retrofit analysis on one London borough, but the single-dataset setup leaves generalization and interpretability claims unproven.

read the letter

The paper shows a gated multimodal network that combines EPC tabular data, assessor text, and GIS spatial features to predict SAP and EI scores, with sample-wise gating weights, an auxiliary band classification head, and post-hoc interpretability via SHAP, text occlusion, and spatial attribution. On the Westminster case study it reports MAEs of 4.03 and 4.76 with R2 values of 0.757 and 0.748, plus ablation gains for the full fusion setup, and then applies the model to wall, roof, and window retrofit scenarios to estimate changes in scores, costs, and emissions.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a gated multimodal neural network that integrates EPC tabular features, assessor free-text descriptions, and GIS-derived spatial attributes (footprint geometry, height, area, orientation) to predict SAP energy efficiency and EI scores. On a Westminster, London dataset it reports MAEs of 4.03 (SAP) and 4.76 (EI) with R² values 0.757 and 0.748, shows full multimodal fusion outperforming unimodal/bimodal baselines in both regression and band classification, supplies interpretability via sample-wise gating weights, SHAP, text occlusion and spatial attribution, and demonstrates the model on retrofit scenarios for wall/roof insulation and window upgrades.

Significance. If the reported performance and interpretability results prove robust, the work supplies a practical, scalable route to city-scale energy-performance screening that bypasses on-site inspections, directly supporting retrofit prioritisation and net-zero planning. Concrete numerical results, systematic ablations, and multiple post-hoc interpretability techniques constitute clear strengths; the retrofit scenario analysis further illustrates decision-relevant utility.

major comments (2)

[§4 and §5] §4 (Experimental Setup) and §5 (Results): the central performance claims (MAE 4.03/4.76, R² 0.757/0.748, ablation superiority) and the interpretability conclusions rest on a single Westminster dataset with no cross-regional, temporal, or multi-city hold-out evaluation. Because EPC tabular, textual and spatial features exhibit well-documented regional biases in assessor practice and building stock, the absence of such testing leaves open whether the multimodal gains and feature attributions generalise or are dataset-specific correlations.
[§4.2] §4.2 (Data and Preprocessing): no information is provided on train/validation/test splits, cross-validation procedure, handling of missing values, or statistical error bars on the reported MAEs and R² values. Without these details it is impossible to determine whether the ablation improvements are statistically reliable or sensitive to post-hoc partitioning choices.

minor comments (2)

[Abstract] The abstract states a 'mean MAE of 4.39' without clarifying whether this is a simple average of the two task MAEs or a weighted combination; a brief clarification would aid reproducibility.
[§3.2] Notation for the gating weights and auxiliary loss coefficients is introduced without an explicit equation reference; adding a numbered equation for the combined loss would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of experimental rigor and generalizability that we address point by point below. Where revisions are feasible, we have updated the manuscript; we also note limitations that cannot be fully resolved without additional data.

read point-by-point responses

Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): the central performance claims (MAE 4.03/4.76, R² 0.757/0.748, ablation superiority) and the interpretability conclusions rest on a single Westminster dataset with no cross-regional, temporal, or multi-city hold-out evaluation. Because EPC tabular, textual and spatial features exhibit well-documented regional biases in assessor practice and building stock, the absence of such testing leaves open whether the multimodal gains and feature attributions generalise or are dataset-specific correlations.

Authors: We agree that reliance on a single-city dataset (Westminster, London) limits strong claims of broad generalizability, particularly given known regional variations in EPC assessor practices and building stock characteristics. The current work is presented as a detailed case study demonstrating the gated multimodal approach on a rich, high-quality multimodal dataset. In the revised manuscript we have added an expanded Limitations and Future Work section that explicitly discusses potential regional biases, the role of interpretability methods (gating weights, SHAP, text occlusion, spatial attribution) in surfacing dataset-specific drivers, and concrete plans for multi-city and temporal validation using additional UK EPC releases. We cannot, however, perform new cross-regional experiments in this revision as we do not have access to equivalent multimodal EPC-GIS-text datasets from other regions. revision: partial
Referee: [§4.2] §4.2 (Data and Preprocessing): no information is provided on train/validation/test splits, cross-validation procedure, handling of missing values, or statistical error bars on the reported MAEs and R² values. Without these details it is impossible to determine whether the ablation improvements are statistically reliable or sensitive to post-hoc partitioning choices.

Authors: We thank the referee for identifying this reporting gap. The original experiments employed an 80/10/10 stratified train/validation/test split (stratified by SAP and EI bands) and handled missing tabular values via median imputation for numeric features and mode imputation for categorical features. To improve transparency and statistical reliability, the revised §4 now details the split procedure, confirms the use of 5-fold cross-validation for all ablation studies, and reports mean ± standard deviation for all key metrics (e.g., SAP MAE 4.03 ± 0.15, R² 0.757 ± 0.012). These additions demonstrate that the multimodal performance gains remain consistent across folds. revision: yes

standing simulated objections not resolved

Empirical cross-regional or multi-city hold-out evaluation, as no additional comparable multimodal EPC datasets from other regions are currently available to the authors.

Circularity Check

0 steps flagged

No circularity: empirical ML predictions on held-out data

full rationale

The paper trains a gated multimodal network on EPC tabular/text/spatial features and reports MAEs (4.03/4.76), R2 values (~0.75), and ablation gains as direct empirical outputs on held-out Westminster data. No equations, self-definitional relations, or load-bearing self-citations reduce these scores to fitted constants or tautologies by construction. The architecture, auxiliary band head, and post-hoc interpretability (SHAP, occlusion, attribution) are defined independently of the target metrics; results remain falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are stated. The claim rests on standard neural network training assumptions and the representativeness of the Westminster EPC/GIS/text dataset.

axioms (1)

domain assumption The three data modalities (tabular EPC, assessor text, GIS spatial) are independent enough that their fusion yields additive gains.
Invoked by the ablation study design and gating mechanism.

pith-pipeline@v0.9.0 · 5627 in / 1440 out tokens · 59472 ms · 2026-05-08T16:19:27.004875+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Y. Chen, Z. Ren, Z. Peng, J. Yang, Z. Chen, Z. Deng, Impacts of climate change and building energy efficiency improvement on city-scale building energy consumption, Journal of Building Engineering 78 (2023) 107646

2023
[2]

Zhong, M

X. Zhong, M. Hu, S. Deetman, B. Steubing, H. X. Lin, G. A. Hernandez, C. Harpprecht, C. Zhang, A. Tukker, P. Behrens, Global greenhouse gas emissions from residential and commercial building materials and mitigation strategies to 2060, Nature Communications 12 (1) (2021) 6126

2060
[3]

Y. Bai, C. Li, S. P. Jenne, S. Zhang, J. Wang, Occupant-centred thermal comfort space heating control via occupant position detection and multiphysics simulation, Building and Environment (2025) 113848

2025
[4]

K. Qu, X. Chen, A. Ekambaram, Y. Cui, G. Gan, A. Økland, S. Riffat, A novel holistic epc related retrofit approach for residential apartment building renovation in norway, Sustainable Cities and Society 54 (2020) 101975

2020
[5]

Amasyali, N

K. Amasyali, N. M. El-Gohary, A review of data-driven building energy consumption prediction studies, Renewable and Sustainable Energy Reviews 81 (2018) 1192–1205

2018
[6]

Beccali, G

M. Beccali, G. Ciulla, V. L. Brano, A. Galatioto, M. Bonomolo, Artificial neural network decision support tool for assessment of the energy performance and the refurbishment actions for the non- residential building stock in southern italy, Energy 137 (2017) 1201–1218

2017
[7]

J. Chen, J. Bai, J. Xu, F. Farazi, S. Mosbach, J. Akroyd, M. Kraft, Transforming building retrofits: Linking energy, equity, and health insights from the world avatar, Advances in Applied Energy 19 (2025) 100230

2025
[8]

J. Few, D. Manouseli, E. McKenna, M. Pullinger, E. Zapata-Webborn, S. Elam, D. Shipworth, T. Oreszczyn, The over-prediction of energy use by epcs in great britain: A comparison of epc-modelled and metered primary energy use intensity, Energy and Buildings 288 (2023) 113024

2023
[9]

U. Ali, S. Bano, M. H. Shamsi, D. Sood, C. Hoare, W. Zuo, N. Hewitt, J. O’Donnell, Urban build- ing energy performance prediction and retrofit analysis using data-driven machine learning approach, Energy and Buildings 303 (2024) 113768. 24

2024
[10]

BRE Group, SAP 10.2: The Government’s Standard Assessment Procedure for Energy Rating of Dwellings,https://bregroup.com/documents/d/bre-group/sap-10-2-14-03-2025, accessed: 2026- 02-01 (2025)

2025
[11]

U. Ali, M. H. Shamsi, M. Bohacek, C. Hoare, K. Purcell, E. Mangina, J. O’Donnell, A data-driven approach to optimize urban scale energy retrofit decisions for residential buildings, Applied Energy 267 (2020) 114861

2020
[12]

Wang, J.-j

L. Wang, J.-j. Peng, J.-q. Wang, A multi-criteria decision-making framework for risk ranking of energy performance contracting project under picture fuzzy environment, Journal of cleaner production 191 (2018) 105–118

2018
[13]

GOV.UK, A guide to energy performance certificates for the marketing, sale and let of dwellings, https://assets.publishing.service.gov.uk/media/5a821a74ed915d74e3401be1/A_guide_to_ energy_performance_certificates_for_the_marketing__sale_and_let_of_dwellings.pdf, accessed: 2026-02-02 (2017)

2026
[14]

Chari, S

A. Chari, S. Christodoulou, Building energy performance prediction using neural networks, Energy Efficiency 10 (5) (2017) 1315–1327

2017
[15]

Y. Liu, H. Chen, L. Zhang, Z. Feng, Enhancing building energy efficiency using a random forest model: A hybrid prediction approach, Energy Reports 7 (2021) 5003–5012

2021
[16]

Momeni, A

S. Momeni, A. Eghbalian, M. Talebzadeh, A. Paksaz, S. K. Bakhtiarvand, S. Shahabi, Enhancing office building energy efficiency: neural network-based prediction of energy consumption, Journal of Building Pathology and Rehabilitation 9 (1) (2024) 68

2024
[17]

Olu-Ajayi, H

R. Olu-Ajayi, H. Alaka, I. Sulaimon, F. Sunmola, S. Ajayi, Building energy consumption prediction for residential buildings using deep learning and other machine learning techniques, Journal of Building Engineering 45 (2022) 103406

2022
[18]

Sheng, H

Y. Sheng, H. Arbabi, W. O. Ward, M. A. Álvarez, M. Mayfield, City-scale residential energy consump- tion prediction with a multimodal approach, Scientific Reports 15 (1) (2025) 5313

2025
[19]

M. Sun, C. Han, Q. Nie, J. Xu, F. Zhang, Q. Zhao, Understanding building energy efficiency with administrative and emerging urban big data by deep learning in glasgow, Energy and Buildings 273 (2022) 112331

2022
[20]

Sheng, W

Y. Sheng, W. O. Ward, H. Arbabi, M. Álvarez, M. Mayfield, Deep multimodal learning for residential building energy prediction, in: IOP conference series: earth and environmental science, Vol. 1078, IOP Publishing, 2022, p. 012038

2022
[21]

Sheng, H

Y. Sheng, H. Arbabi, W. O. Ward, M. Mayfield, Learning from other cities: Transfer learning based multimodal residential energy prediction for cities with limited existing data, Energy and Buildings 338 (2025) 115723

2025
[22]

S. G. K. Uyar, B. K. Ozbay, B. Dal, Interpretable building energy performance prediction using xgboost quantile regression, Energy and Buildings (2025) 115815

2025
[23]

Y. Shen, Y. Pan, Bim-supported automatic energy performance analysis for green building design using explainable machine learning and multi-objective optimization, Applied Energy 333 (2023) 120575

2023
[24]

X. Li, Z. Han, G. Liu, A multimodal generative adversarial nets model for the prediction of matrix-based building performance, in: Building Simulation 2023, Vol. 18, IBPSA, 2023, pp. 1795–1802

2023
[25]

J. Lu, Y. Wen, et al., Multi-indicator performance prediction in residential buildings: A multimodal fusion method based on cross-attention, Building and Environment (2026) 114603. 25

2026
[26]

Moveh, E

S. Moveh, E. A. Merchán-Cruz, M. Abuhussain, S. Alhumaid, K. Almazam, Y. A. Dodo, Multi-building energy forecasting through weather-integrated temporal graph neural networks, Buildings 15 (5) (2025) 808

2025
[27]

Department for Levelling Up, Housing and Communities, Energy Performance of Buildings Data Eng- land and Wales,https://epc.opendatacommunities.org/domestic/search, accessed: 2026-02-02 (2026)

2026
[28]

Ordnance Survey, OS MasterMap Topography Layer,https://www.ordnancesurvey.co.uk/ products/os-mastermap-topography-layer, accessed: 2026-02-03 (2026)

2026
[29]

D. P. Kingma, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review arXiv 2014
[30]

S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, Advances in neural information processing systems 30 (2017). 26

2017