Recognition: unknown
Revealing Geography-Driven Signals in Zone-Level Claim Frequency Models: An Empirical Study using Environmental and Visual Predictors
Pith reviewed 2026-05-08 14:08 UTC · model grok-4.3
The pith
Geographic features from maps and imagery improve accuracy in zone-level motor insurance claim frequency models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Geographic information constructed from OpenStreetMap indicators, CORINE Land Cover, and Belgian orthoimagery augments standard actuarial variables to raise predictive accuracy in zone-level Motor Third Party Liability claim frequency models. Linear and tree-based models both improve, with the strongest results from latitude-longitude paired with environmental features at the 5 km scale; smaller neighborhoods still help baselines. Image embeddings add value mainly when environmental features are unavailable, and overall performance hinges more on how geography is represented than on model complexity.
What carries the argument
Zone-level aggregation of claims paired with constructed geographic predictors—coordinates, scale-specific environmental features, and pretrained vision-transformer embeddings—added to GLM, regularized GLM, and gradient-boosted tree baselines.
If this is right
- Coordinates combined with 5 km environmental features deliver the largest accuracy lift for both linear and tree-based models.
- Environmental features at smaller neighborhood scales still improve baseline specifications.
- Pretrained image embeddings raise accuracy and stability for regularized GLMs only when environmental features are absent.
- The predictive contribution of geography depends less on model type than on the chosen representation of location.
Where Pith is reading between the lines
- The same open-data approach could be tested in other insurance lines where location influences risk but detailed addresses are restricted.
- Moving from zone aggregates to policy-level data might show whether the geographic signals strengthen or weaken without averaging.
- Widespread use of these public sources could lower dependence on proprietary location datasets for actuarial work.
Load-bearing premise
The observed accuracy gains truly reflect location-based risk differences rather than dataset-specific correlations or the particular feature construction choices.
What would settle it
Repeating the same feature additions on an independent insurance dataset from another country or time period and finding no gain or a loss in held-out predictive metrics.
Figures
read the original abstract
Geographic context is often consider relevant to motor insurance risk, yet public actuarial datasets provide limited location identifiers, constraining how this information can be incorporated and evaluated in claim-frequency models. This study examines how geographic information from alternative data sources can be incorporated into actuarial models for Motor Third Party Liability (MTPL) claim prediction under such constraints. Using the BeMTPL97 dataset, we adopt a zone-level modeling framework and evaluate predictive performance on unseen postcodes. Geographic information is introduced through two channels: environmental indicators from OpenStreetMap and CORINE Land Cover, and orthoimagery released by the Belgian National Geographic Institute for academic use. We evaluate the predictive contribution of coordinates, environmental features, and image embeddings across three baseline models: generalized linear models (GLMs), regularized GLMs, and gradient-boosted trees, while raw imagery is modeled using convolutional neural networks. Our results show that augmenting actuarial variables with constructed geographic information improves accuracy. Across experiments, both linear and tree-based models benefit most from combining coordinates with environmental features extracted at 5 km scale, while smaller neighborhoods also improve baseline specifications. Generally, image embeddings do not improve performance when environmental features are available; however, when such features are absent, pretrained vision-transformer embeddings enhance accuracy and stability for regularized GLMs. Our results show that the predictive value of geographic information in zone-level MTPL frequency models depends less on model complexity than on how geography is represented, and illustrate that geographic context can be incorporated despite limited individual-level spatial information.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper investigates incorporating geographic information from public sources (OpenStreetMap, CORINE Land Cover, Belgian orthoimagery) into zone-level MTPL claim-frequency models on the BeMTPL97 dataset. It evaluates GLMs, regularized GLMs, and gradient-boosted trees on unseen postcodes, claiming that augmenting actuarial baselines with coordinates plus environmental features (especially at 5 km scale) improves accuracy, while image embeddings help mainly when environmental features are absent; the predictive value depends more on geography representation than model complexity.
Significance. If the reported gains hold after addressing selection and leakage concerns, the work would provide actionable evidence that alternative geographic data can enhance actuarial models when individual-level location identifiers are limited. It would illustrate practical trade-offs between feature construction and model class, with potential to inform risk pricing in motor insurance using publicly available spatial layers.
major comments (2)
- [Abstract] Abstract: the central claim that 'both linear and tree-based models benefit most from combining coordinates with environmental features extracted at 5 km scale' is presented without any quantitative metrics (e.g., change in Poisson deviance, log-loss, or AUC), confidence intervals, or a table of results across all tested scales; this omission makes it impossible to judge the magnitude or robustness of the improvement that underpins the paper's main contribution.
- [Evaluation methodology] Evaluation methodology (implied in abstract and results description): hold-out on 'unseen postcodes' is not described as spatially blocked or geographically stratified; because environmental features are extracted from fixed external maps at fixed radii, random postcode splits permit spatial autocorrelation leakage between train and test sets, which directly threatens the claim that observed gains reflect genuine location-based risk signals rather than correlated predictors.
minor comments (2)
- [Abstract] Abstract: grammatical error ('is often consider relevant' should be 'is often considered relevant').
- [Abstract] Abstract: the statement 'smaller neighborhoods also improve baseline specifications' is imprecise; it should specify the radii tested and report the corresponding performance deltas to allow readers to assess the scale-sensitivity claim.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'both linear and tree-based models benefit most from combining coordinates with environmental features extracted at 5 km scale' is presented without any quantitative metrics (e.g., change in Poisson deviance, log-loss, or AUC), confidence intervals, or a table of results across all tested scales; this omission makes it impossible to judge the magnitude or robustness of the improvement that underpins the paper's main contribution.
Authors: We agree that the abstract would be strengthened by including quantitative indicators of the reported improvements. The results section of the manuscript already contains tables and figures with performance metrics (Poisson deviance, log-loss) across models, feature sets, and spatial scales. In the revised version we will update the abstract to report the key numerical gains (e.g., relative reduction in Poisson deviance for the best coordinate-plus-5 km environmental configuration versus the actuarial baseline) and will explicitly reference the corresponding results table. revision: yes
-
Referee: [Evaluation methodology] Evaluation methodology (implied in abstract and results description): hold-out on 'unseen postcodes' is not described as spatially blocked or geographically stratified; because environmental features are extracted from fixed external maps at fixed radii, random postcode splits permit spatial autocorrelation leakage between train and test sets, which directly threatens the claim that observed gains reflect genuine location-based risk signals rather than correlated predictors.
Authors: We acknowledge the validity of this concern. The current evaluation splits postcodes into train and test sets to evaluate performance on unseen locations, but a purely random postcode split does not explicitly enforce spatial separation. Because environmental features are derived from fixed-radius buffers, nearby postcodes can share highly correlated predictors, raising the possibility of leakage. In the revision we will replace the simple hold-out with a spatially blocked or geographically stratified procedure (e.g., blocking by larger administrative units or using a distance-based split) and will report the updated performance metrics together with a discussion of how this change affects the interpretation of the geographic signals. revision: yes
Circularity Check
No significant circularity in empirical geographic feature augmentation
full rationale
The paper is a standard empirical ML study on the BeMTPL97 dataset. It augments zone-level claim frequency models with coordinates, environmental features from public sources (OpenStreetMap, CORINE), and image embeddings, then reports predictive performance on held-out postcodes using GLMs, regularized GLMs, and gradient-boosted trees. No mathematical derivation chain exists that reduces predictions or results to inputs by construction. No self-citations are load-bearing, no fitted parameters are relabeled as independent predictions, and no ansatzes or uniqueness theorems are invoked. The reported accuracy gains are data-driven outcomes evaluated against external benchmarks, making the analysis self-contained.
Axiom & Free-Parameter Ledger
free parameters (2)
- environmental feature extraction scale (5 km)
- model hyperparameters (regularization, tree parameters)
axioms (2)
- domain assumption Zone-level aggregation preserves predictive signals without introducing bias that geographic features merely compensate for
- domain assumption Environmental and visual features from public sources are causally or predictively relevant to claim frequency
Reference graph
Works this paper leans on
-
[1]
URL https://www.soa.org/4a6a75/globalassets/assets/files/resources/experience-studies/2019/ltc-intercompany-study.pdf
Long term care intercompany study, January 2015. URL https://www.soa.org/4a6a75/globalassets/assets/files/resources/experience-studies/2019/ltc-intercompany-study.pdf. Accessed: 2025-02-14
2015
-
[2]
Arlot, S. and Celisse, A. A survey of cross-validation procedures for model selection. Statistics surveys, 4 0 (none): 0 40--79, 2010. ISSN 1935-7516. doi:10.1214/09-SS054
-
[3]
Asabere, N. Y., Asare, I. O., Lawson, G., Balde, F., Duodu, N. Y., Tsoekeku, G., Afriyie, P. O., and Ganiu, A. R. A. Geo-insurance: Improving big data challenges in the context of insurance services using a geographical information system (gis). Human Behavior and Emerging Technologies, 2024 0 (1): 0 9015012, 2024. doi:10.1155/2024/9015012
-
[4]
Ayuso, M., Guillen, M., and Nielsen, J. P. Improving automobile insurance ratemaking using telematics: incorporating mileage and driver behaviour data. Transportation, 46 0 (3): 0 735--752, 2019. doi:10.1007/s11116-018-9890-7
-
[5]
Benedek, B. and Nagy, B. Z. Traditional versus ai-based fraud detection: cost efficiency in the field of automobile insurance. Financial and Economic Review, 22 0 (2): 0 77--98, 2023. doi:10.33893/FER.22.2.77
-
[6]
Deep learning, volume 1
Bengio, Y., Goodfellow, I., Courville, A., et al. Deep learning, volume 1. MIT press Cambridge, MA, USA, 2017
2017
-
[7]
Ai revolution in insurance: bridging research and reality
Bhattacharya, S., Castignani, G., Masello, L., and Sheehan, B. Ai revolution in insurance: bridging research and reality. Frontiers in Artificial Intelligence, 8: 0 1568266, 2025. doi:10.3389/frai.2025.1568266
-
[8]
Geographic ratemaking with spatial embeddings
Blier-Wong, C., Cossette, H., Lamontagne, L., and Marceau, E. Geographic ratemaking with spatial embeddings. ASTIN Bulletin: The Journal of the IAA, 52 0 (1): 0 1--31, 2022. doi:10.1017/asb.2021.25
-
[9]
A representation-learning approach for insurance pricing with images
Blier-Wong, C., Lamontagne, L., and Marceau, E. A representation-learning approach for insurance pricing with images. ASTIN Bulletin: The Journal of the IAA, 54 0 (2): 0 280--309, 2024. doi:10.1017/asb.2024.9
-
[10]
Burka, D., Kov \'a cs, L., and Szepesv \'a ry, L. Modelling mtpl insurance claim events: Can machine learning methods overperform the traditional glm approach? Hungarian Statistical Review, 4 0 (2), 2021. doi:10.35618/hsr2021.02.en034
-
[11]
Belgian national geospatial data portal
Cartesius / National Geographic Institute (NGI Belgium) . Belgian national geospatial data portal. https://www.cartesius.be. Accessed 04.12.2025
2025
-
[12]
Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785--794, 2016. doi:10.1145/2939672.2939785
-
[13]
Clemente, C., Guerreiro, G. R., and Bravo, J. M. Modelling motor insurance claim frequency and severity using gradient boosting. Risks, 11 0 (9): 0 163, 2023. doi:10.3390/risks11090163
-
[14]
CORINE Land Cover 2000 (CLC 2000)
Copernicus Land Monitoring Service / European Environment Agency . CORINE Land Cover 2000 (CLC 2000) . European Union’s Copernicus Land Monitoring Service, 2020. URL https://land.copernicus.eu/en/products/corine-land-cover/clc-2000. Accessed 02.12.2025
2000
-
[15]
Ding, N., Ruan, X., Wang, H., and Liu, Y. Automobile insurance fraud detection based on pso-xgboost model and interpretable machine learning method. Insurance: Mathematics and Economics, 120: 0 51--60, 2025. ISSN 0167-6687. doi:https://doi.org/10.1016/j.insmatheco.2024.11.006. URL https://www.sciencedirect.com/science/article/pii/S0167668724001112
-
[16]
Dong, P. and Quan, Z. Automated machine learning in insurance. Insurance: Mathematics and Economics, 120: 0 17--41, 2025. ISSN 0167-6687. doi:https://doi.org/10.1016/j.insmatheco.2024.10.002. URL https://www.sciencedirect.com/science/article/pii/S0167668724001057
-
[17]
Dubey, A., Parida, T., Birajdar, A., Prajapati, A. K., and Rane, S. Smart underwriting system: An intelligent decision support system for insurance approval & risk assessment. In 2018 3rd International Conference for Convergence in Technology (I2CT), pages 1--6. IEEE, 2018. doi:10.1109/I2CT.2018.8529792
-
[18]
and Charpentier, A
Dutang, C. and Charpentier, A. CASdatasets: Insurance datasets, 2024. R package version 1.2-0
2024
-
[19]
Insurance dataset
Dutang, C., Charpentier, A., and Gallic, E. Insurance dataset. 2024
2024
-
[20]
Belgium postcode boundaries
Environmental Systems Research Institute (Esri) . Belgium postcode boundaries. https://www.arcgis.com/home/item.html?id=e385aeef974a4aea8ae7fb1b0efc1341, 2022. GIS dataset accessed January 2026
2022
-
[21]
Fouad, M. M., Malawany, K., Osman, A. G., Amer, H. M., Abdulkhalek, A. M., and Eldin, A. B. Automated vehicle inspection model using a deep learning approach. Journal of Ambient Intelligence and Humanized Computing, 14 0 (10): 0 13971--13979, 2023. doi:10.1007/s12652-022-04105-3
-
[22]
Gao, G., Wang, H., and W \"u thrich, M. V. Boosting poisson regression models with telematics car driving data. Machine Learning, 111 0 (1): 0 243--272, 2022. doi:10.2139/ssrn.3596034
-
[23]
Gupta, S., Ghardallou, W., Pandey, D. K., and Sahu, G. P. Artificial intelligence adoption in the insurance industry: Evidence using the technology--organization--environment framework. Research in International Business and Finance, 63: 0 101757, 2022. doi:10.1016/j.ribaf.2022.101757
-
[24]
Haberman, S. and Renshaw, A. E. Generalized linear models and actuarial science. Journal of the Royal Statistical Society: Series D (The Statistician), 45 0 (4): 0 407--436, 1996. doi:10.2307/2988543
-
[25]
The Elements of Statistical Learning
Hastie, T., Tibshirani, R., Friedman, J., et al. The elements of statistical learning. Springer, New York, 2009. ISBN 978-0-387-84857-0. doi:10.1007/978-0-387-84858-7
-
[26]
Deep residual learning for image recognition
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016. doi:10.1109/CVPR.2016.90
-
[27]
Henckaerts, R. and Antonio, K. The added value of dynamically updating motor insurance prices with telematics collected driving behavior data. Insurance: Mathematics and Economics, 105: 0 79--95, 2022. doi:10.1016/j.insmatheco.2022.03.011
-
[28]
Boosting insights in insurance tariff plans with tree-based machine learning methods
Henckaerts, R., C \^o t \'e , M.-P., Antonio, K., and Verbelen, R. Boosting insights in insurance tariff plans with tree-based machine learning methods. North American Actuarial Journal, 25 0 (2): 0 255--285, 2021. doi:10.1080/10920277.2020.1745656
-
[29]
Holvoet, F., Antonio, K., and Henckaerts, R. Neural networks for insurance pricing with frequency and severity data: a benchmark study from data preprocessing to technical tariff. North American Actuarial Journal, pages 1--44, 2025. doi:10.1080/10920277.2025.2451860
-
[30]
Ibrahim, J., Stanley, J., Murfi, H., Novkaniza, F., and Devila, S. Evaluating xgboost for competitive insurance pricing: A case study on motor third-party liability insurance. In 2024 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA), pages 847--852. IEEE, 2024. doi:10.1109/icicyta64807.2024.10912952
-
[31]
M., Ahamed, T., Matsushita, S., and Noguchi, R
Islam, M. M., Ahamed, T., Matsushita, S., and Noguchi, R. A damage-based crop insurance system for flash flooding: a satellite remote sensing and econometric approach. In Remote sensing application II: A climate change perspective in agriculture, pages 121--163. Springer, 2024. doi:10.1007/978-981-97-1188-8\_5
-
[32]
ISO 19109:2022 Geographic information -- Rules for application schema
ISO . ISO 19109:2022 Geographic information -- Rules for application schema . Standard, International Organization for Standardization, Geneva, Switzerland, 2022
2022
-
[33]
Impact of ai in the general insurance underwriting factors
Jaiswal, R. Impact of ai in the general insurance underwriting factors. Central European Management Journal, 31 0 (2): 0 697--705, 2023
2023
-
[34]
Kita-Wojciechowska, K. and Kidzi \'n ski, . Google street view image predicts car accident risk. Central European Economic Journal, 6 0 (53): 0 151--163, 2019. doi:10.2478/ceej-2019-0011
-
[35]
Heung-Chang Lee and Jeonggeun Song
Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 1998. ISSN 0018-9219. doi:10.1109/5.726791
-
[36]
A survey of convolutional neural networks: analysis, applications, and prospects
Li, Z., Liu, F., Yang, W., Peng, S., and Zhou, J. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems, 33 0 (12): 0 6999--7019, 2021. doi:10.1109/tnnls.2021.3084827
-
[37]
A., Goodchild, M
Longley, P. A., Goodchild, M. F., Maguire, D. J., and Rhind, D. W. Geographic information science and systems. John Wiley & Sons, 2015
2015
-
[38]
McCullagh, P. Generalized linear models. Routledge, 2019. doi:10.1201/9780203753736
-
[39]
C., Belnap, T., Dwivedi, P., Deligani, A
Nguyen, Q. C., Belnap, T., Dwivedi, P., Deligani, A. H. N., Kumar, A., Li, D., Whitaker, R., Keralis, J., Mane, H., Yue, X., et al. Google street view images as predictors of patient health outcomes, 2017--2019. Big data and cognitive computing, 6 0 (1): 0 15, 2022. doi:10.3390/bdcc6010015
-
[40]
Noll, A., Salzmann, R., and Wuthrich, M. V. Case study: French motor third-party liability claims. Available at SSRN 3164764, 2020. doi:10.2139/ssrn.3164764
-
[41]
Nomic embed vision: Expanding the latent space.arXiv preprint arXiv:2406.18587, 2024
Nussbaum, Z., Duderstadt, B., and Mulyar, A. Nomic embed vision: Expanding the latent space. arXiv preprint arXiv:2406.18587, 2024. doi:10.48550/arXiv.2406.18587
-
[42]
OpenStreetMap , 2025 a
OpenStreetMap contributors . OpenStreetMap , 2025 a . URL https://www.openstreetmap.org. Data licensed under the Open Database License (ODbL)
2025
-
[43]
OpenStreetMap Belgium Data Extract
OpenStreetMap contributors . OpenStreetMap Belgium Data Extract . Geofabrik GmbH, 2025 b . URL https://download.geofabrik.de/europe/belgium.html. Distributed by Geofabrik. Licensed under ODbL
2025
-
[44]
Social network analytics for supervised fraud detection in insurance
\'O skarsd \'o ttir, M., Ahmed, W., Antonio, K., Baesens, B., Dendievel, R., Donas, T., and Reynkens, T. Social network analytics for supervised fraud detection in insurance. Risk Analysis, 42 0 (8): 0 1872--1890, 2022. doi:10.1111/risa.13693
-
[45]
A., Corzo-Garc \' a, D., Pro-Mart \' n, J
P \'e rez-Zarate, S. A., Corzo-Garc \' a, D., Pro-Mart \' n, J. L., \'A lvarez-Garc \' a, J. A., Mart \' nez-del Amor, M. A., and Fern \'a ndez-Cabrera, D. Automated car damage assessment using computer vision: Insurance company use case. Applied Sciences, 14 0 (20): 0 9560, 2024. doi:10.3390/app14209560
-
[46]
On the validation of claims with excess zeros in liability insurance: A comparative study
Qazvini, M. On the validation of claims with excess zeros in liability insurance: A comparative study. Risks, 7 0 (3): 0 71, 2019. doi:10.3390/risks7030071
-
[47]
Rababaah, A. R. Investigation of deep learning models for vehicle damage classification. In 2023 10th International Conference on Signal Processing and Integrated Networks (SPIN), pages 25--30. IEEE, 2023. doi:10.1109/spin57001.2023.10116703
-
[48]
Seyam, E. A. Predicting motor insurance claim incidence using generalized and tree-based models: A comparative statistical approach. Insurance Markets and Companies, 16 0 (2): 0 38, 2025. doi:10.21511/ins.16(2).2025.04
-
[49]
Stevenson, M., Mues, C., and Bravo, C. Deep residential representations: Using unsupervised learning to unlock elevation data for geo-demographic prediction. ISPRS Journal of Photogrammetry and Remote Sensing, 187: 0 378--392, 2022. ISSN 0924-2716. doi:https://doi.org/10.1016/j.isprsjprs.2022.03.015. URL https://www.sciencedirect.com/science/article/pii/S...
-
[50]
Thiran, P. and Thomas, I. Accidents de la route et distance au domicile. approche quantitative pour bruxelles. Les Cahiers Scientifiques du Transport-Scientific Papers in Transportation, 32, 1997. doi:10.46298/cst.11958
-
[51]
Tufvesson, O., Lindstr \"o m, J., and Lindstr \"o m, E. Spatial statistical modelling of insurance risk: a spatial epidemiological approach to car insurance. Scandinavian Actuarial Journal, 2019 0 (6): 0 508--522, 2019. doi:10.1080/03461238.2019.1576146
-
[52]
Vít, O., Seif, L., and Štěpánek, L. Claim frequency estimation in motor third-party liability (mtpl): Classical statistical models versus machine learning methods. In Annals of Computer Science and Information Systems, volume 45, pages 161--166. Polish Information Processing Society, 2025. doi:10.15439/2025f5118
-
[53]
Predictive analytics in long term care
Zail, H. Predictive analytics in long term care. In Actuarial Aspects of Long Term Care, pages 309--336. Springer, 2019. doi:10.1007/978-3-030-05660-5\_13
-
[54]
C., Li, M., and Smola, A
Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. Dive into deep learning. Cambridge University Press, 2023
2023
-
[55]
Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67 0 (2): 0 301--320, 2005. doi:10.1111/j.1467-9868.2005.00503.x
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.