arxiv: 2604.21087 · v2 · submitted 2026-04-22 · 📊 stat.AP

Recognition: unknown

Model quality in football: Quantifying the quality of an Expected Threat model

Geurt Jongbloed, Jakob S\"ohl, Koen van Arem, Mirjam Bruinsma

Pith reviewed 2026-05-09 22:07 UTC · model grok-4.3

classification 📊 stat.AP

keywords expected threatmodel errorMarkov chainfootball analyticsscoutinglog-normal distributionmodel validationplayer evaluation

0 comments

The pith

The Expected Threat model error is approximately log-normally distributed for a given number of training points and game states, providing thresholds for when player evaluations become unreliable in scouting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how error in the Expected Threat model grows or shrinks with the number of game states and training data points. It uses the model's underlying Markov chain to run theoretical calculations and simulations that reveal the error distribution shape. The authors combine those results with input from experts to identify the error level at which player ratings lose reliability for scouting decisions. From this they extract practical rules of thumb for building and checking an xT model before use. The same approach is noted to apply to broader Expected Possession Value models.

Core claim

Using the Markov chain that underlies the Expected Threat model, theoretical analyses and simulations demonstrate that model error is approximately log-normally distributed once the number of training points and game states are fixed. These simulations, paired with expert consultation, establish the error magnitude beyond which player evaluations derived from the model become unreliable for scouting applications, from which the authors derive rules of thumb for ensuring model quality prior to deployment.

What carries the argument

The Markov chain representation of football game states, which enables both theoretical derivation and simulation of the model's error distribution.

If this is right

Model error follows an approximately log-normal distribution once training points and game states are fixed.
There exists an identifiable error threshold past which Expected Threat-based player evaluations are unreliable for scouting.
Rules of thumb can be applied to check model quality before practical use.
The same quantification framework extends directly to Expected Possession Value models.
A validated model can be used to generate reliable player evaluations in scouting workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams could adjust the number of game states or data points to stay below the unreliable threshold while keeping computational cost low.
The log-normal error shape may allow simple statistical tests to certify a new model before it is put into production.
Similar simulation-plus-expert protocols could be applied to other unobservable-ground-truth models in sports analytics.
Long-term monitoring of actual match outcomes against model predictions would provide an ongoing check on whether the error threshold remains stable.

Load-bearing premise

The Markov chain accurately captures the real dynamics of football play and expert judgment supplies a valid threshold for when model error makes evaluations unreliable.

What would settle it

A large out-of-sample validation set in which the observed distribution of Expected Threat model errors deviates substantially from log-normality or in which player ratings remain stable and useful well beyond the expert-derived error threshold.

read the original abstract

The recent growth in data availability in football has increased the risk of incorrect use of data-driven models, making guidelines on their validation and application necessary. The Expected Threat (xT) model is an accessible option for football organizations that start building in-house methods, yet little is known about how to assess its quality. The aim of this study is twofold: to examine how the model error depends on the number of game states and the number of training points, and to translate these results into guidelines for constructing and applying the model. Using the Markov chain underlying the model, we perform theoretical analyses and simulations to study the model error. These show that the model error is approximately log-normally distributed for a specified number of training points and game states. Additionally, we combine the simulations with expert consultation to establish the model error beyond which player evaluations based on the Expected Threat model become unreliable for scouting applications. From this, we derive rules of thumb to ensure the quality of an Expected Threat model before application, and we illustrate through an example how a validated model can be applied in practice. Because the approach generalizes to Expected Possession Value models, this paper illustrates a framework to systematically quantify model quality, despite the ground truth being unobservable in football analytics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a simulation-based way to bound sampling error in xT models and set expert thresholds, but those bounds stay inside the model's assumptions.

read the letter

The key thing to know is that this paper develops a framework for assessing the quality of Expected Threat models by analyzing their estimation error through simulations on the underlying Markov chain and then calibrating reliability thresholds with expert input. What stands out as new is the combination of showing that the error in xT values is approximately log-normally distributed for given numbers of training points and game states, along with the expert-derived cutoff for when player evaluations become unreliable in scouting contexts. They back this with theoretical analyses and Monte Carlo simulations, derive rules of thumb for model construction, and include a practical example of applying a validated model. The note that the approach can extend to Expected Possession Value models broadens its potential reach. The work does a good job of being explicit about the steps: they use the Markov chain formulation directly for the error study, which keeps things reproducible within that setup, and they avoid claiming they have ground truth by instead focusing on when the model error affects downstream decisions. The main limitation comes from the simulation design. Since the data for the simulations is generated from the fitted Markov chain itself, the resulting error distribution only measures sampling variability assuming the model structure is correct. It does not account for potential misspecification, such as the discretization of the pitch into zones, the omission of player-specific or contextual factors, or the non-Markovian nature of actual play sequences. As a result, the log-normal characterization and the thresholds for unreliability may not fully capture the discrepancies that arise when the model is used on real match data. The expert consultation adds a practical layer but remains somewhat subjective and tied to the simulated scenarios. Overall, this is a paper for football analysts and data teams at clubs who are implementing or reviewing xT models and need guidance on minimum standards for data volume and state granularity. It provides a starting point for systematic quality checks in an area where direct validation is hard. I would recommend sending it for peer review. The methodological contribution is clear and the paper is honest about working without observable ground truth, though revisions could usefully address how the framework performs against external benchmarks.

Referee Report

2 major / 2 minor

Summary. The manuscript examines the quality of Expected Threat (xT) models in football by leveraging the underlying Markov chain formulation. Through theoretical analyses and Monte Carlo simulations, it establishes that estimation error in xT values is approximately log-normally distributed for fixed numbers of training points and game states. The authors combine simulation results with expert consultation to identify a model-error threshold beyond which xT-based player evaluations become unreliable for scouting, from which they derive practical rules of thumb for model construction and illustrate an application example. The framework is presented as generalizable to Expected Possession Value models, providing a systematic approach to model validation where ground truth is unobservable.

Significance. If the internal error characterization and expert-calibrated threshold hold under the stated assumptions, the work supplies a concrete, simulation-driven framework for quantifying xT model quality that could help organizations avoid over-reliance on under-specified models in scouting and tactical analysis. The explicit use of the Markov chain for both theoretical derivations and controlled simulations is a methodological strength that allows precise statements about finite-sample behavior.

major comments (2)

[simulation methodology and results] The Monte Carlo simulation design (described in the methods section on simulations and results) samples transition counts directly from the fitted Markov chain, thereby quantifying only sampling variance under correct specification. This construction supports the log-normal error claim within the model but does not inject continuous pitch locations, player-specific effects, or non-Markovian history that characterize real match data. Because the derived reliability threshold and rules of thumb are intended for scouting applications on actual data, the absence of misspecification analysis is load-bearing for the central claim that the guidelines ensure model quality in practice.
[expert consultation and threshold derivation] The expert consultation used to set the numerical threshold for unreliable player evaluations (abstract and the section combining simulations with expert input) is presented without details on the number or expertise of participants, the precise elicitation protocol, or sensitivity of the threshold to alternative values. This threshold directly determines the rules of thumb, so lack of transparency and robustness checks weakens the translation from simulation results to actionable guidelines.

minor comments (2)

[abstract and conclusion] The abstract states that the approach 'generalizes to Expected Possession Value models' but provides no explicit demonstration or discussion of the required modifications; a short paragraph or appendix illustrating the extension would strengthen the claim.
[notation and methods] Notation for the number of game states and training points is introduced without a consolidated table of symbols; adding such a table would improve readability when the log-normal parameters are later referenced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. Below we provide point-by-point responses to the major comments. We agree with the need for greater transparency and will revise the manuscript accordingly to strengthen the presentation of our methods and results.

read point-by-point responses

Referee: [simulation methodology and results] The Monte Carlo simulation design (described in the methods section on simulations and results) samples transition counts directly from the fitted Markov chain, thereby quantifying only sampling variance under correct specification. This construction supports the log-normal error claim within the model but does not inject continuous pitch locations, player-specific effects, or non-Markovian history that characterize real match data. Because the derived reliability threshold and rules of thumb are intended for scouting applications on actual data, the absence of misspecification analysis is load-bearing for the central claim that the guidelines ensure model quality in practice.

Authors: Our simulations are specifically constructed to analyze the finite-sample behavior of the xT estimator under the Markov chain model assumptions, which enables the theoretical derivation of the approximate log-normal distribution of the error. This approach isolates the effect of the number of training points and game states on estimation error, providing a controlled environment to establish baseline reliability. We recognize that real-world football data may include additional complexities such as continuous spatial effects and non-Markovian dependencies not captured in the discrete state model. The rules of thumb are therefore presented as necessary but not sufficient conditions for model quality in practice, and we suggest they be used in conjunction with out-of-sample validation on real data. In the revised version, we will include an expanded discussion of these limitations and potential extensions to account for misspecification. revision: yes
Referee: [expert consultation and threshold derivation] The expert consultation used to set the numerical threshold for unreliable player evaluations (abstract and the section combining simulations with expert input) is presented without details on the number or expertise of participants, the precise elicitation protocol, or sensitivity of the threshold to alternative values. This threshold directly determines the rules of thumb, so lack of transparency and robustness checks weakens the translation from simulation results to actionable guidelines.

Authors: We acknowledge the importance of providing full details on the expert consultation to allow for proper evaluation of the threshold's robustness. The revised manuscript will include a more detailed description of the consultation process, specifying the number of experts involved, their relevant expertise in football analytics and scouting, the structured elicitation protocol employed, and results from sensitivity analyses varying the threshold value to assess impact on the derived rules of thumb. revision: yes

Circularity Check

0 steps flagged

No significant circularity: estimator error analysis is self-contained under the assumed model.

full rationale

The paper's central results derive from theoretical analysis and Monte Carlo simulation of the sampling distribution of the xT estimator when data are generated from the fitted Markov chain itself. This is a standard statistical procedure for characterizing finite-sample properties and does not reduce the reported log-normal error distribution or the expert-calibrated reliability threshold to any definitional equivalence, fitted-input renaming, or self-citation chain. The Markov chain representation and expert consultation are treated as external inputs rather than outputs of the same fitting step. No load-bearing self-citation, ansatz smuggling, or uniqueness theorem imported from prior author work is required for the claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; the central claim rests on the Markov chain model of football and the validity of expert judgment for error thresholds. No free parameters or invented entities are identifiable from the abstract.

axioms (1)

domain assumption Football can be represented as a Markov chain whose states capture the relevant game situations for threat calculation.
Invoked as the foundation for both theoretical error analysis and simulations.

pith-pipeline@v0.9.0 · 5528 in / 1343 out tokens · 49053 ms · 2026-05-09T22:07:25.246924+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 16 canonical work pages

[1]

Science and Medicine in Football10(1), 80–95 (2025) https://doi.org/10.1080/24733938 .2025.2476478

Dello Iacono, A., Datson, N., Clubb, J., Lacome, M., Sullivan, A., Shushan, T.: Data analytics practices and reporting strategies in senior football: insights into athlete health and performance from over 200 practitioners worldwide. Science and Medicine in Football10(1), 80–95 (2025) https://doi.org/10.1080/24733938 .2025.2476478

work page doi:10.1080/24733938 2025
[2]

In: Raval, M.S., Kaya, T., Artan, N.S., Taber, C

Kholkine, L.: Opportunities and challenges of machine learning in sports. In: Raval, M.S., Kaya, T., Artan, N.S., Taber, C. (eds.) Sports Data Analytics: Tech- niques, Applications, and Innovations, pp. 243–260. Springer, Singapore (2026). https://doi.org/10.1007/978-981-95-5132-3 13

work page doi:10.1007/978-981-95-5132-3 2026
[3]

Frontiers in Sports and Active Living7(2025) https://doi.or g/10.3389/fspor.2025.1569155

Teixeira, J.E., Maio, E., Afonso, P., Encarna¸ c˜ ao, S., Machado, G.F., Morgans, R., Barbosa, T.M., Monteiro, A.M., Forte, P., Ferraz, R., Branquinho, L.: Mapping football tactical behavior and collective dynamics with artificial intelligence: a systematic review. Frontiers in Sports and Active Living7(2025) https://doi.or g/10.3389/fspor.2025.1569155

work page doi:10.3389/fspor.2025.1569155 2025
[4]

Biology of Sport40(1), 249– 263 (2023) https://doi.org/10.5114/biolsport.2023.112970 24

Rico-Gonz´ alez, M., Pino-Ortega, J., M´ endez, A., Clemente, F., Baca, A.: Machine learning application in soccer: a systematic review. Biology of Sport40(1), 249– 263 (2023) https://doi.org/10.5114/biolsport.2023.112970 24

work page doi:10.5114/biolsport.2023.112970 2023
[5]

Science and Medicine in Football, 1–13 (2025) https://doi.org/10.1080/24733938.2025.2533784

Olthof, S., Davis, J.: Perspectives on data analytics for gaining a competitive advantage in football: computational approaches to tactics. Science and Medicine in Football, 1–13 (2025) https://doi.org/10.1080/24733938.2025.2533784

work page doi:10.1080/24733938.2025.2533784 2025
[6]

In: Proceedings of the 7th Workshop on Machine Learning and Data Mining for Sports Analytics, pp

Robberechts, P., Davis, J.: How data availability affects the ability to learn good xG models. In: Proceedings of the 7th Workshop on Machine Learning and Data Mining for Sports Analytics, pp. 17–27. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-64912-8 2

work page doi:10.1007/978-3-030-64912-8 2020
[7]

Frontiers in Sports and Active Living3 - 2021(2021) https://doi.org/10.3389/fspor.2021.624475

Anzer, G., Bauer, P.: A goal scoring probability model for shots based on syn- chronized positional and event data in football (soccer). Frontiers in Sports and Active Living3 - 2021(2021) https://doi.org/10.3389/fspor.2021.624475

work page doi:10.3389/fspor.2021.624475 2021
[8]

PLOS ONE18(4), 0282295 (2023) https: //doi.org/10.1371/journal.pone.0282295

Mead, J., O’Hare, A., McMenemy, P.: Expected goals in football: Improving model performance and demonstrating value. PLOS ONE18(4), 0282295 (2023) https: //doi.org/10.1371/journal.pone.0282295

work page doi:10.1371/journal.pone.0282295 2023
[9]

Presented at the 2011 New Eng- land Symposium on Statistics in Sports, Harvard University, Cambridge, MA, 24 September 2011 (2011)

Rudd, S.: A Framework for Tactical Analysis and Individual Offensive Production Assessment in Soccer Using Markov Chains. Presented at the 2011 New Eng- land Symposium on Statistics in Sports, Harvard University, Cambridge, MA, 24 September 2011 (2011)

2011
[10]

In: Proceedings of the AAAI 2020 Workshop on Artificial Intelligence in Team Sports

Van Roy, M., Robberechts, P., Decroos, T., Davis, J.: Valuing on-the-ball actions in soccer: A critical comparison of xt and vaep. In: Proceedings of the AAAI 2020 Workshop on Artificial Intelligence in Team Sports. AAAI Press, New York, USA (2020). https://tomdecroos.github.io/reports/xt vs vaep.pdf

2020
[11]

https://karun.in/blog/expected-t hreat.html

Singh, K.: Introducing Expected Threat (xT). https://karun.in/blog/expected-t hreat.html. Accessed: 5-11-2024 (2018)

2024
[12]

SpringerPlus5(1) (2016) https: //doi.org/10.1186/s40064-016-3108-2

Rein, R., Memmert, D.: Big data and tactical analysis in elite soccer: future challenges and opportunities for sports science. SpringerPlus5(1) (2016) https: //doi.org/10.1186/s40064-016-3108-2

work page doi:10.1186/s40064-016-3108-2 2016
[13]

International Journal of Sports Science & Coaching 14(6), 798–817 (2019) https://doi.org/10.1177/1747954119879350

Herold, M., Goes, F., Nopp, S., Bauer, P., Thompson, C., Meyer, T.: Machine learning in men’s professional football: Current applications and future directions for improving attacking play. International Journal of Sports Science & Coaching 14(6), 798–817 (2019) https://doi.org/10.1177/1747954119879350

work page doi:10.1177/1747954119879350 2019
[14]

PhD thesis, KU Leuven (2025)

Bransen, L.: Beyond the scoreline: Using machine learning techniques to under- stand (women’s) soccer. PhD thesis, KU Leuven (2025)

2025
[15]

Paper presented at the 15th International Conference on the Engineering of Sport (ISEA 2024), Lough- borough University, Loughborough, 8–11 July 2024 (2024)

Van Arem, K.W., Bruinsma, M.: Extended xThreat: an explainable quality assess- ment method for actions in football using game context. Paper presented at the 15th International Conference on the Engineering of Sport (ISEA 2024), Lough- borough University, Loughborough, 8–11 July 2024 (2024). https://doi.org/10.1 7028/RD.LBORO.27045427.V1 25

2024
[16]

Applied Sciences 15(8) (2025) https://doi.org/10.3390/app15084151

Hassani, K., Ramdani, M., Lotfi, M.: Dynamic expected threat (dxt) model: Addressing the deficit of realism in football action evaluation. Applied Sciences 15(8) (2025) https://doi.org/10.3390/app15084151

work page doi:10.3390/app15084151 2025
[17]

Paper presented at the 13th MIT Sloan Sports Analytics Conference, Boston, MA, 1-2 March 2019 (2019)

Fern´ andez, J., Bornn, L., Cervone, D.: Decomposing the Immeasurable Sport: A Deep Learning Expected Possession Value Framework for Soccer. Paper presented at the 13th MIT Sloan Sports Analytics Conference, Boston, MA, 1-2 March 2019 (2019). https://www.sloansportsconference.com/research-papers/decomposing-t he-immeasurable-sport-a-deep-learning-expected...

2019
[18]

Machine Learning 110(6), 1389–1427 (2021) https://doi.org/10.1007/s10994-021-05989-6

Fern´ andez, J., Bornn, L., Cervone, D.: A framework for the fine-grained evalua- tion of the instantaneous expected value of soccer possessions. Machine Learning 110(6), 1389–1427 (2021) https://doi.org/10.1007/s10994-021-05989-6

work page doi:10.1007/s10994-021-05989-6 2021
[19]

Paper presented at the 15th MIT Sloan Sports Analytics Conference, Boston, MA, 8-9 April 2021 (2021)

St¨ ockl, M., Seidl, T., Marley, D., Power, P.: Making Offensive Play Predictable: Using a Graph Convolutional Network to Understand Defensive Performance in Soccer. Paper presented at the 15th MIT Sloan Sports Analytics Conference, Boston, MA, 8-9 April 2021 (2021). https://www.sloansportsconference.com/re search-papers/making-offensive-play-predictable-...

2021
[20]

Paper presented at the 13th International Conference on Sport Sciences Research and Technology Support (icSPORTS), Marbella, Spain, 21-22 October 2025 (2025)

Overmeer, T., Janssen, T., Nuijten, W.: Revisiting Expected Possession Value in Football: Introducing a U-Net architecture, reward and risk for passes, and a benchmark. Paper presented at the 13th International Conference on Sport Sciences Research and Technology Support (icSPORTS), Marbella, Spain, 21-22 October 2025 (2025). https://doi.org/10.5220/00137...

work page doi:10.5220/0013784300003988 2025
[21]

Jour- nal of Artificial Intelligence Research77, 517–562 (2023) https://doi.org/10.161 3/jair.1.13934

Van Roy, M., Robberechts, P., Yang, W.-C., De Raedt, L., Davis, J.: A markov framework for learning and reasoning about strategies in professional soccer. Jour- nal of Artificial Intelligence Research77, 517–562 (2023) https://doi.org/10.161 3/jair.1.13934

2023
[22]

The CMS experiment at the CERN LHC

Pulis, M., Bajada, J.: Reinforcement Learning for Football Player Decision Making Analysis. Paper presented at the Statsbomb Conference, London, 20 September 2022 (2022). https://www.um.edu.mt/library/oar/handle/123456789 /131785

work page arXiv 2022
[23]

International Journal of Sports Science & Coaching 19(1), 230–244 (2024) https://doi.org/10.1177/17479541231154494 https://doi.org/10.1177/17479541231154494

Rahimian, P., Van Haaren, J., Toka, L.: Towards maximizing expected pos- session outcome in soccer. International Journal of Sports Science & Coaching 19(1), 230–244 (2024) https://doi.org/10.1177/17479541231154494 https://doi.org/10.1177/17479541231154494

work page doi:10.1177/17479541231154494 2024
[24]

Paper presented at the MIT Sloan Sports Analytics Conference, 6–7 March 2026 (2026)

Kim, H., Seo, S., Choi, H., Boomstra, T., Yoon, J., Park, C.: Better Prevent than Tackle: valuing defense in soccer based on graph neural networks. Paper presented at the MIT Sloan Sports Analytics Conference, 6–7 March 2026 (2026). https://www.sloansportsconference.com/research-papers/better-prevent-than-t 26 ackle-valuing-defense-in-soccer-based-on-grap...

2026
[25]

https://github.com/statsbomb/open-data

StatsBomb: Open Data. https://github.com/statsbomb/open-data. Accessed: 2024-10-15 (2024)

2024
[26]

Machine Learning113, 6977–7010 (2024) https://doi.org/10.1007/s10994-024-06585-0

Davis, J., Bransen, L., Devos, L.,et al.: Methodology and evaluation in sports analytics: Challenges, approaches, and lessons learned. Machine Learning113, 6977–7010 (2024) https://doi.org/10.1007/s10994-024-06585-0

work page doi:10.1007/s10994-024-06585-0 2024
[27]

https://www.tudelft.nl/dhpc/ark: /44463/DelftBluePhase2 (2024) 27

(DHPC): DelftBlue Supercomputer (Phase 2). https://www.tudelft.nl/dhpc/ark: /44463/DelftBluePhase2 (2024) 27

2024