Recognition: no theorem link
New insights into Elo algorithm for practitioners and statisticians
Pith reviewed 2026-05-13 16:56 UTC · model grok-4.3
The pith
Elo's heuristic and statistical views align exactly only for logistic expected scores, but estimation noise requires decoupling the ranking model from the prediction model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Both the practitioner's heuristic feedback rule and the statistician's online maximum likelihood estimation via stochastic gradient ascent coincide exactly in the binary case if and only if the expected score is the logistic function. Estimation noise forces a principled decoupling between the model used for ranking and the model used for prediction: the effective scale and home-field advantage parameter must be adjusted to account for the noise, with closed-form corrections and a data-driven identification procedure provided. For multilevel outcomes an exact relationship holds when outcome scores are uniformly spaced, but noise-aware approximations are preferred in general because they fit
What carries the argument
The noise-induced decoupling between the ranking model and the prediction model, implemented through closed-form adjustments to the scale parameter and home-field advantage.
If this is right
- The decoupled approach yields substantially better predictions than reusing the ranking model directly for prediction.
- The adjustment procedure acts as a diagnostic that reveals whether rating estimates have converged.
- Closed-form corrections are available for binary outcomes while approximations handle general multilevel scores.
- Application to FIFA data indicates that the ranking process had not converged for the vast majority of national teams.
Where Pith is reading between the lines
- Similar decoupling may improve predictive accuracy in other heuristic rating systems that rely on online gradient-style updates.
- The data-driven identification procedure could be applied routinely by practitioners to tune parameters on their own competition data.
- The convergence diagnostic might be used to decide when to stop updating ratings in ongoing tournaments or leagues.
Load-bearing premise
That the effects of estimation noise can be accurately captured and corrected by the derived closed-form adjustments without introducing new biases.
What would settle it
If the decoupled model's out-of-sample prediction accuracy on held-out match data is no better than the conventional model's, or if the adjusted scale and home-field values differ substantially from those identified directly from the same data, the need for decoupling would be challenged.
Figures
read the original abstract
This work reconciles two perspectives on the Elo ranking that coexist in the literature: the practitioner's view as a heuristic feedback rule, and the statistician's view as online maximum likelihood estimation via stochastic gradient ascent. Both perspectives coincide exactly in the binary case (iff the expected score is the logistic function). However, estimation noise forces a principled decoupling between the model used for ranking and the model used for prediction: the effective scale and home-field advantage parameter must be adjusted to account for the noise. We provide both closed-form corrections and a data-driven identification procedure. For multilevel outcomes, an exact relationship exists when outcome scores are uniformly spaced, but approximations are preferred in general: they account for estimation noise and better fit the data. The decoupled approach substantially outperforms the conventional one that reuses the ranking model for prediction, and serves as a diagnostic of convergence status. Applied to six years of FIFA men's ranking, we find that the ranking had not converged for the vast majority of national teams. The paper is written in a semi-tutorial style accessible to practitioners, with all key results accompanied by closed-form expressions and numerical examples.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reconciles the heuristic feedback-rule view of Elo with its interpretation as online MLE via stochastic gradient ascent, showing exact coincidence in the binary logistic case. It argues that estimation noise necessitates decoupling the ranking model from the prediction model, supplying closed-form corrections (and a data-driven procedure) for the effective scale and home-field parameters; for multilevel scores it offers approximations that incorporate noise and improve fit. The decoupled approach is reported to outperform the conventional reuse of the ranking model on FIFA data and to diagnose non-convergence for most national teams.
Significance. If the closed-form noise corrections are valid, the work supplies a principled, practitioner-accessible improvement to Elo that separates ranking from prediction, yields a convergence diagnostic, and demonstrates measurable gains on real sports data. The explicit reconciliation of the two literatures and the provision of closed-form expressions are genuine strengths.
major comments (2)
- [Abstract and section on closed-form corrections] The central claim that estimation noise admits accurate closed-form corrections to the scale and home-field parameters (derived from the stochastic-gradient model) rests on an implicit noise distribution whose validity is asserted but not rigorously tested beyond the provided examples. Any departure from the modeled form, non-uniform spacing of multilevel scores, or violation of outcome independence would turn the adjustment into a source of bias rather than a correction.
- [FIFA application] The FIFA application concludes that rankings had not converged for the vast majority of teams; this diagnosis depends on the adjusted parameters correctly identifying non-convergence, yet the manuscript supplies neither a formal error analysis of the data-driven identification procedure nor cross-validation against held-out matches.
minor comments (2)
- The semi-tutorial style is helpful, but the numerical examples would benefit from explicit step-by-step derivation of the closed-form expressions rather than final results only.
- Notation for the effective scale parameter versus the original scale parameter should be introduced once and used consistently to avoid reader confusion.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and limitations of our derivations. We address each major point below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and section on closed-form corrections] The central claim that estimation noise admits accurate closed-form corrections to the scale and home-field parameters (derived from the stochastic-gradient model) rests on an implicit noise distribution whose validity is asserted but not rigorously tested beyond the provided examples. Any departure from the modeled form, non-uniform spacing of multilevel scores, or violation of outcome independence would turn the adjustment into a source of bias rather than a correction.
Authors: The closed-form corrections are derived exactly from the stochastic-gradient update rule under the logistic model, where the effective noise distribution is induced by the finite-sample parameter updates rather than posited separately. We agree that the manuscript would benefit from more explicit discussion of the assumptions (outcome independence and the form of the induced noise) and from additional validation. In revision we will add a subsection on the derivation assumptions together with simulation experiments that assess sensitivity to mild violations of independence and non-uniform score spacing. These additions will not change the core closed-form expressions but will make their domain of applicability clearer. revision: partial
-
Referee: [FIFA application] The FIFA application concludes that rankings had not converged for the vast majority of teams; this diagnosis depends on the adjusted parameters correctly identifying non-convergence, yet the manuscript supplies neither a formal error analysis of the data-driven identification procedure nor cross-validation against held-out matches.
Authors: The non-convergence conclusion follows directly from comparing the data-driven estimates of the effective scale and home-field parameters against the values implied by the ranking model. While the procedure itself is fully specified, we acknowledge that a formal error analysis and explicit cross-validation on held-out matches are absent. In the revised manuscript we will include a cross-validation exercise that holds out recent matches, re-estimates the effective parameters on the training window, and checks whether the adjusted model yields improved predictive accuracy on the held-out set; we will also report the variability of the identified convergence status across different training-window lengths. revision: yes
Circularity Check
No significant circularity; derivation self-contained via MLE equivalence and noise model
full rationale
The paper derives the exact coincidence of heuristic Elo and online MLE in the binary logistic case directly from the definitions of stochastic gradient ascent and the logistic expected-score function, without fitting or self-reference. Closed-form noise corrections for scale and home-field advantage follow from the same stochastic-gradient model under explicit noise assumptions, yielding independent adjustments rather than refits. The data-driven identification procedure is presented as an optional supplement, not the load-bearing step for the decoupling claim or the FIFA non-convergence diagnosis. No step reduces by construction to its inputs, no self-citation chain is load-bearing, and the outperformance result is tied to the derived expressions rather than tautological renaming or ansatz smuggling.
Axiom & Free-Parameter Ledger
free parameters (2)
- effective scale parameter
- home-field advantage parameter
axioms (2)
- domain assumption The expected score follows the logistic function
- ad hoc to paper Estimation noise can be corrected via closed-form expressions derived from the stochastic gradient model
Reference graph
Works this paper leans on
-
[1]
" write newline "" initialize.prev.this.status FUNCTION begin.bib " write newline preamble empty 'skip preamble write newline if " thebibliography " longest.label * " " * write newline " [1] #1 " write newline " url@samestyle " write newline " " write newline " [2] #2 " write newline " =0pt " write newline " " ALTinterwordstretchfactor * " " * write newli...
-
[2]
, " * write output.state after.block = add.period write newline
ENTRY address author booktitle chapter edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence...
-
[3]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in " " * FUNCTION format....
-
[4]
(2026): Soccerway, ://www.soccerway.com
work page 2026
-
[5]
(2017): Elo ratings and the sports model: A neglected topic in applied probability? Statist
Aldous, D. (2017): Elo ratings and the sports model: A neglected topic in applied probability? Statist. Sci., 32, 616--629, ://doi.org/10.1214/17-STS628
-
[6]
Angelini, G., V. Candila, and L. De Angelis (2021): Weighted E lo rating for tennis match predictions, European Journal of Operational Research, ://www.sciencedirect.com/science/article/pii/S0377221721003234
work page 2021
-
[7]
Anthropic (2026): Claude, ://www.anthropic.com
work page 2026
-
[8]
(2012): Bayesian reasoning and Machine Learning, Cambridge University Press
Barber, D. (2012): Bayesian reasoning and Machine Learning, Cambridge University Press
work page 2012
-
[9]
Brandes, U., G. Marmulla, and I. Smokovic (2025): Efficient computation of tournament winning probabilities, Journal of Sports Analytics, 11, 22150218251313905, ://doi.org/10.1177/22150218251313905
- [10]
-
[11]
Csat \'o , L. (2023): Quantifying the unfairness of the 2018 FIFA W orld C up qualification, International Journal of Sports Science & Coaching, 18, 183--196, ://doi.org/10.1177/17479541211073455
-
[12]
Csat \'o , L. (2024): Club coefficients in the UEFA champions league: Time for shift to an Elo -based formula, International Journal of Performance Analysis in Sport, 24, 119--134, ://doi.org/10.1080/24748668.2023.2274221
-
[13]
Darrell Bock, R. (1972): Estimating item parameters and latent ability when responses are scored in two or more nominal categories, Psychometrika, 37, 29--51, ://doi.org/10.1007/BF02291411
- [14]
-
[15]
Davidson, R. R. and R. J. Beaver (1977): On extending the B radley- T erry model to incorporate within-pair order effects, Biometrics, 33, 693--702
work page 1977
-
[16]
Egidi, L. and I. Ntzoufras (2020): A Bayesian Quest for Finding a Unified Model for Predicting Volleyball Games , Journal of the Royal Statistical Society Series C: Applied Statistics, 69, 1307--1336, ://doi.org/10.1111/rssc.12436
-
[17]
Egidi, L., F. Pauli, and N. Torelli (2018): Combining historical data and bookmakers' odds in modelling football scores, Statistical Modelling, 18, 436--459, ://doi.org/10.1177/1471082X18798414
-
[18]
Egidi, L. and N. Torelli (2021): Comparing goal-based and result-based approaches in modelling football outcomes, Social Indicators Research, 156, 801--813, ://doi.org/10.1007/s11205-020-02293-z
-
[19]
Elo, A. E. (1978): The Rating of chessplayers, past and present, New York, NY, USA: Arco Publishing Inc
work page 1978
-
[20]
eloratings.net (2020): World football E lo ratings, ://www.eloratings.net/
work page 2020
-
[21]
FIDE (2019): International chess federation: ratings change calculator, ://ratings.fide.com/calculator_rtd.phtml
work page 2019
-
[22]
FIFA (2018): Revision of the FIFA / C oca- C ola world ranking, ://digitalhub.fifa.com/m/f99da4f73212220/original/edbm045h0udbwkqew35a-pdf.pdf
work page 2018
-
[23]
FiveThirtyEight (2020): How our NFL predictions work, ://fivethirtyeight.com/methodology/how-our-nfl-predictions-work/
work page 2020
-
[24]
Football Rankings (2026): Football rankings, ://www.football-rankings.info/
work page 2026
-
[25]
Gelman, A., J. Hwang, and A. Vehtari (2014): Understanding predictive information criteria for B ayesian models, Statistics and Computing, 24, 997--1016, ://doi.org/10.1007/s11222-013-9416-2
-
[26]
Glickman, M. E. (1995): Chess rating systems, American Chess Journal, 3, 59--102, ://www.chabris.com/pub/acj/3/AmericanChessJournalIssue3.pdf
work page 1995
-
[27]
Glickman, M. E. (1999): Parameter estimation in large dynamic paired comparison experiments, Journal of the Royal Statistical Society: Series C (Applied Statistics), 48, 377--394, ://dx.doi.org/10.1111/1467-9876.00159
- [28]
-
[29]
Gomes de Pinho Zanco, D., L. Szczecinski, E. Vinicius Kuhn, and R. Seara (2024): Stochastic analysis of the Elo rating algorithm in round-robin tournaments, Digital Signal Processing, 145, 104313, ://www.sciencedirect.com/science/article/pii/S1051200423004086
work page 2024
-
[30]
Hvattum, L. M. and H. Arntzen (2010): Using Elo ratings for match result prediction in association football, International Journal of Forecasting, 26, 460 -- 470, ://www.sciencedirect.com/science/article/pii/S0169207009001708, sports Forecasting
work page 2010
-
[31]
Ingram, M. (2021): How to extend E lo: a B ayesian perspective, Journal of Quantitative Analysis in Sports, 17, 203--219, ://doi.org/10.1515/jqas-2020-0066
-
[32]
Jabin, P.-E. and S. Junca (2015): A continuous model for ratings, SIAM J. Appl. Math, 2, 420--442, ://doi.org/10.1137/140969324
-
[33]
Karlis, D. and I. Ntzoufras (2008): Bayesian modelling of football outcomes: using the Skellam's distribution for the goal difference , IMA Journal of Management Mathematics, 20, 133--145, ://doi.org/10.1093/imaman/dpn026
- [34]
-
[35]
Kovalchik, S. (2020): Extension of the E lo rating system to margin of victory, International Journal of Forecasting, 36, 1329--1341, ://www.sciencedirect.com/science/article/pii/S0169207020300157
work page 2020
-
[36]
Langville, A. N. and C. D. Meyer (2012): Who's \#1, The Science of Rating and Ranking, Princeton University Press
work page 2012
-
[37]
Lapr \'e , M. A. and J. G. Amato (2025): The impact of imbalanced groups in uefa euro 1980--2024 and comparison with the fifa world cup, Journal of Quantitative Analysis in Sports, ://doi.org/10.1515/jqas-2024-0151
-
[38]
Lasek, J. and M. Gagolewski (2018): The efficacy of league formats in ranking teams, Statistical Modelling, 18, 411 -- 435
work page 2018
-
[39]
Lasek, J. and M. Gagolewski (2021): Interpretable sports team rating models based on the gradient descent algorithm, International Journal of Forecasting, 37, 1061--1071, ://www.sciencedirect.com/science/article/pii/S0169207020301849
work page 2021
-
[40]
Lasek, J., Z. Szl \'a vik, and S. Bhulai (2013): The predictive power of ranking systems in association football, International Journal of Applied Pattern Recognition, 1, 27--46, ://www.inderscienceonline.com/doi/abs/10.1504/IJAPR.2013.052339, pMID: 52339
-
[41]
Morel-Balbi, S. and A. Kirkley (2025): Estimation of partial rankings from sparse, noisy comparisons, Communications Physics, 9
work page 2025
-
[42]
Newman, M. E. J. (2023): Efficient computation of rankings from pairwise comparisons, Journal of Machine Learning Research, 24, 1--25, ://jmlr.org/papers/v24/22-1086.html
work page 2023
-
[43]
Rao, P. V. and L. L. Kupper (1967): Ties in paired-comparison experiments: A generalization of the B radley- T erry model, Journal of the American Statistical Association, 62, 194--204, ://amstat.tandfonline.com/doi/abs/10.1080/01621459.1967.10482901
-
[44]
Sonas, J. (2011): The E lo rating system -- correcting the expectancy tables, Technical report, ://en.chessbase.com/post/the-elo-rating-system-correcting-the-expectancy-tables
work page 2011
-
[45]
Szczecinski, L. (2022): G- E lo: generalization of the E lo algorithm by modeling the discretized margin of victory, Journal of Quantitative Analysis in Sports, 18, 1--14, ://doi.org/10.1515/jqas-2020-0115
-
[46]
Szczecinski, L. and A. Djebbi (2020): Understanding draws in Elo rating algorithm, Journal of Quantitative Analysis in Sports, 16, 211--220, ://www.degruyter.com/document/doi/10.1515/jqas-2019-0102/html
- [47]
-
[48]
Szczecinski, L. and R. Tihon (2023): Simplified K alman filter for online rating: one-fits-all approach, Journal of Quantitative Analysis in Sports, 19, 295--315, ://arxiv.org/abs/2104.14012, https://doi.org/10.1515/jqas-2021-0061
-
[49]
Thurston, L. L. (1927): A law of comparative judgement, Psychological Review, 34, 273--286
work page 1927
-
[50]
(2020): A taxonomy of polytomous item response models, ://arxiv.org/abs/2010.01382.pdf
Tutz, G. (2020): A taxonomy of polytomous item response models, ://arxiv.org/abs/2010.01382.pdf
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.