pith. machine review for the scientific record. sign in

arxiv: 2604.03840 · v1 · submitted 2026-04-04 · 📊 stat.ME · cs.LG

Recognition: no theorem link

New insights into Elo algorithm for practitioners and statisticians

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:56 UTC · model grok-4.3

classification 📊 stat.ME cs.LG
keywords Elo algorithmranking systemsmaximum likelihood estimationstochastic gradient ascentlogistic functionestimation noiseFIFA rankingsconvergence diagnostic
0
0 comments X

The pith

Elo's heuristic and statistical views align exactly only for logistic expected scores, but estimation noise requires decoupling the ranking model from the prediction model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reconciles the common view of the Elo algorithm as a simple feedback rule for updating team ratings with its interpretation as online maximum likelihood estimation performed by stochastic gradient ascent. These two perspectives match exactly when outcomes are binary and the expected score follows the logistic function. Estimation noise in the ratings forces the model optimized for ranking to be adjusted separately when used for predicting outcomes, by changing the effective scale and any home-field advantage parameter. Closed-form corrections are derived for binary cases and approximations for multilevel scores, leading to better predictions than the standard approach that reuses the same model for both tasks. When applied to six years of FIFA men's soccer data, the adjusted method shows that the ranking had not converged for most national teams.

Core claim

Both the practitioner's heuristic feedback rule and the statistician's online maximum likelihood estimation via stochastic gradient ascent coincide exactly in the binary case if and only if the expected score is the logistic function. Estimation noise forces a principled decoupling between the model used for ranking and the model used for prediction: the effective scale and home-field advantage parameter must be adjusted to account for the noise, with closed-form corrections and a data-driven identification procedure provided. For multilevel outcomes an exact relationship holds when outcome scores are uniformly spaced, but noise-aware approximations are preferred in general because they fit

What carries the argument

The noise-induced decoupling between the ranking model and the prediction model, implemented through closed-form adjustments to the scale parameter and home-field advantage.

If this is right

  • The decoupled approach yields substantially better predictions than reusing the ranking model directly for prediction.
  • The adjustment procedure acts as a diagnostic that reveals whether rating estimates have converged.
  • Closed-form corrections are available for binary outcomes while approximations handle general multilevel scores.
  • Application to FIFA data indicates that the ranking process had not converged for the vast majority of national teams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar decoupling may improve predictive accuracy in other heuristic rating systems that rely on online gradient-style updates.
  • The data-driven identification procedure could be applied routinely by practitioners to tune parameters on their own competition data.
  • The convergence diagnostic might be used to decide when to stop updating ratings in ongoing tournaments or leagues.

Load-bearing premise

That the effects of estimation noise can be accurately captured and corrected by the derived closed-form adjustments without introducing new biases.

What would settle it

If the decoupled model's out-of-sample prediction accuracy on held-out match data is no better than the conventional model's, or if the adjusted scale and home-field values differ substantially from those identified directly from the same data, the need for decoupling would be challenged.

Figures

Figures reproduced from arXiv: 2604.03840 by Leszek Szczecinski.

Figure 1
Figure 1. Figure 1: Trajectories of the estimated skills obtained using the Elo algorithm [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Conditional probability functions Py(z/s + η) (97) defining the AC model with α and δ given in (101) and (102), s = 174, and η = 0.8. The solid thick line denotes the expected value of the score, G(z/s + η), given in (104) and solid dashed line denotes the approximation of the latter using a canonical function L(z/s˜ + ˜η) with ˜s and ˜η in (113) and (58). Examples of the function Py(z) are shown in [PITH… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between G(z/s) and its approximation L(z/s˜), for L = 3, ˜s = sβAC→L, βAC→L given in (115), δ = [0, 0.5, 1], α = [0, α1, 0], where the values of α1 are given in the legend; s = 174. For smaller values of z, the curves practically superimpose. For α1 = log 2 ≈ 0.7, we have a true equivalence of the expected scores, i.e., G(z/s) = L(z/s˜), where ˜s = 2s. Motivated by analysis which follows (110), … view at source ↗
Figure 4
Figure 4. Figure 4: Skills (left axis) of the team, from the best to the worst, (thin [PITH_FULL_IMAGE:figures/full_fig_p036_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Percentage of international FIFA teams which have played at least [PITH_FULL_IMAGE:figures/full_fig_p037_5.png] view at source ↗
read the original abstract

This work reconciles two perspectives on the Elo ranking that coexist in the literature: the practitioner's view as a heuristic feedback rule, and the statistician's view as online maximum likelihood estimation via stochastic gradient ascent. Both perspectives coincide exactly in the binary case (iff the expected score is the logistic function). However, estimation noise forces a principled decoupling between the model used for ranking and the model used for prediction: the effective scale and home-field advantage parameter must be adjusted to account for the noise. We provide both closed-form corrections and a data-driven identification procedure. For multilevel outcomes, an exact relationship exists when outcome scores are uniformly spaced, but approximations are preferred in general: they account for estimation noise and better fit the data. The decoupled approach substantially outperforms the conventional one that reuses the ranking model for prediction, and serves as a diagnostic of convergence status. Applied to six years of FIFA men's ranking, we find that the ranking had not converged for the vast majority of national teams. The paper is written in a semi-tutorial style accessible to practitioners, with all key results accompanied by closed-form expressions and numerical examples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reconciles the heuristic feedback-rule view of Elo with its interpretation as online MLE via stochastic gradient ascent, showing exact coincidence in the binary logistic case. It argues that estimation noise necessitates decoupling the ranking model from the prediction model, supplying closed-form corrections (and a data-driven procedure) for the effective scale and home-field parameters; for multilevel scores it offers approximations that incorporate noise and improve fit. The decoupled approach is reported to outperform the conventional reuse of the ranking model on FIFA data and to diagnose non-convergence for most national teams.

Significance. If the closed-form noise corrections are valid, the work supplies a principled, practitioner-accessible improvement to Elo that separates ranking from prediction, yields a convergence diagnostic, and demonstrates measurable gains on real sports data. The explicit reconciliation of the two literatures and the provision of closed-form expressions are genuine strengths.

major comments (2)
  1. [Abstract and section on closed-form corrections] The central claim that estimation noise admits accurate closed-form corrections to the scale and home-field parameters (derived from the stochastic-gradient model) rests on an implicit noise distribution whose validity is asserted but not rigorously tested beyond the provided examples. Any departure from the modeled form, non-uniform spacing of multilevel scores, or violation of outcome independence would turn the adjustment into a source of bias rather than a correction.
  2. [FIFA application] The FIFA application concludes that rankings had not converged for the vast majority of teams; this diagnosis depends on the adjusted parameters correctly identifying non-convergence, yet the manuscript supplies neither a formal error analysis of the data-driven identification procedure nor cross-validation against held-out matches.
minor comments (2)
  1. The semi-tutorial style is helpful, but the numerical examples would benefit from explicit step-by-step derivation of the closed-form expressions rather than final results only.
  2. Notation for the effective scale parameter versus the original scale parameter should be introduced once and used consistently to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of our derivations. We address each major point below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and section on closed-form corrections] The central claim that estimation noise admits accurate closed-form corrections to the scale and home-field parameters (derived from the stochastic-gradient model) rests on an implicit noise distribution whose validity is asserted but not rigorously tested beyond the provided examples. Any departure from the modeled form, non-uniform spacing of multilevel scores, or violation of outcome independence would turn the adjustment into a source of bias rather than a correction.

    Authors: The closed-form corrections are derived exactly from the stochastic-gradient update rule under the logistic model, where the effective noise distribution is induced by the finite-sample parameter updates rather than posited separately. We agree that the manuscript would benefit from more explicit discussion of the assumptions (outcome independence and the form of the induced noise) and from additional validation. In revision we will add a subsection on the derivation assumptions together with simulation experiments that assess sensitivity to mild violations of independence and non-uniform score spacing. These additions will not change the core closed-form expressions but will make their domain of applicability clearer. revision: partial

  2. Referee: [FIFA application] The FIFA application concludes that rankings had not converged for the vast majority of teams; this diagnosis depends on the adjusted parameters correctly identifying non-convergence, yet the manuscript supplies neither a formal error analysis of the data-driven identification procedure nor cross-validation against held-out matches.

    Authors: The non-convergence conclusion follows directly from comparing the data-driven estimates of the effective scale and home-field parameters against the values implied by the ranking model. While the procedure itself is fully specified, we acknowledge that a formal error analysis and explicit cross-validation on held-out matches are absent. In the revised manuscript we will include a cross-validation exercise that holds out recent matches, re-estimates the effective parameters on the training window, and checks whether the adjusted model yields improved predictive accuracy on the held-out set; we will also report the variability of the identified convergence status across different training-window lengths. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via MLE equivalence and noise model

full rationale

The paper derives the exact coincidence of heuristic Elo and online MLE in the binary logistic case directly from the definitions of stochastic gradient ascent and the logistic expected-score function, without fitting or self-reference. Closed-form noise corrections for scale and home-field advantage follow from the same stochastic-gradient model under explicit noise assumptions, yielding independent adjustments rather than refits. The data-driven identification procedure is presented as an optional supplement, not the load-bearing step for the decoupling claim or the FIFA non-convergence diagnosis. No step reduces by construction to its inputs, no self-citation chain is load-bearing, and the outperformance result is tied to the derived expressions rather than tautological renaming or ansatz smuggling.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the logistic function as a domain assumption for exact coincidence and introduces adjusted scale and home-field parameters as free parameters to correct for estimation noise derived from the MLE framework.

free parameters (2)
  • effective scale parameter
    Adjusted to account for estimation noise when using the model for prediction rather than ranking
  • home-field advantage parameter
    Adjusted to account for estimation noise in the decoupled prediction model
axioms (2)
  • domain assumption The expected score follows the logistic function
    Required for the exact coincidence between heuristic and MLE perspectives in the binary case
  • ad hoc to paper Estimation noise can be corrected via closed-form expressions derived from the stochastic gradient model
    Central to the proposed decoupling between ranking and prediction models

pith-pipeline@v0.9.0 · 5486 in / 1483 out tokens · 48807 ms · 2026-05-13T16:56:02.685901+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

  1. [1]

    write newline

    " write newline "" initialize.prev.this.status FUNCTION begin.bib " write newline preamble empty 'skip preamble write newline if " thebibliography " longest.label * " " * write newline " [1] #1 " write newline " url@samestyle " write newline " " write newline " [2] #2 " write newline " =0pt " write newline " " ALTinterwordstretchfactor * " " * write newli...

  2. [2]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence...

  3. [3]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in " " * FUNCTION format....

  4. [4]

    (2026): Soccerway, ://www.soccerway.com

  5. [5]

    (2017): Elo ratings and the sports model: A neglected topic in applied probability? Statist

    Aldous, D. (2017): Elo ratings and the sports model: A neglected topic in applied probability? Statist. Sci., 32, 616--629, ://doi.org/10.1214/17-STS628

  6. [6]

    Candila, and L

    Angelini, G., V. Candila, and L. De Angelis (2021): Weighted E lo rating for tennis match predictions, European Journal of Operational Research, ://www.sciencedirect.com/science/article/pii/S0377221721003234

  7. [7]

    Anthropic (2026): Claude, ://www.anthropic.com

  8. [8]

    (2012): Bayesian reasoning and Machine Learning, Cambridge University Press

    Barber, D. (2012): Bayesian reasoning and Machine Learning, Cambridge University Press

  9. [9]

    Marmulla, and I

    Brandes, U., G. Marmulla, and I. Smokovic (2025): Efficient computation of tournament winning probabilities, Journal of Sports Analytics, 11, 22150218251313905, ://doi.org/10.1177/22150218251313905

  10. [10]

    Cortez, R. and H. Tossounian (2026): Convergence and stationary distribution of E lo rating systems, ://arxiv.org/abs/2410.09180

  11. [11]

    Csat \'o , L. (2023): Quantifying the unfairness of the 2018 FIFA W orld C up qualification, International Journal of Sports Science & Coaching, 18, 183--196, ://doi.org/10.1177/17479541211073455

  12. [12]

    Csat \'o , L. (2024): Club coefficients in the UEFA champions league: Time for shift to an Elo -based formula, International Journal of Performance Analysis in Sport, 24, 119--134, ://doi.org/10.1080/24748668.2023.2274221

  13. [13]

    (1972): Estimating item parameters and latent ability when responses are scored in two or more nominal categories, Psychometrika, 37, 29--51, ://doi.org/10.1007/BF02291411

    Darrell Bock, R. (1972): Estimating item parameters and latent ability when responses are scored in two or more nominal categories, Psychometrika, 37, 29--51, ://doi.org/10.1007/BF02291411

  14. [14]

    Davidson, R. R. (1970): On extending the B radley- T erry model to accommodate ties in paired comparison experiments, Journal of the American Statistical Association, 65, 317--328, ://www.jstor.org/stable/2283595

  15. [15]

    Davidson, R. R. and R. J. Beaver (1977): On extending the B radley- T erry model to incorporate within-pair order effects, Biometrics, 33, 693--702

  16. [16]

    Egidi, L. and I. Ntzoufras (2020): A Bayesian Quest for Finding a Unified Model for Predicting Volleyball Games , Journal of the Royal Statistical Society Series C: Applied Statistics, 69, 1307--1336, ://doi.org/10.1111/rssc.12436

  17. [17]

    Pauli, and N

    Egidi, L., F. Pauli, and N. Torelli (2018): Combining historical data and bookmakers' odds in modelling football scores, Statistical Modelling, 18, 436--459, ://doi.org/10.1177/1471082X18798414

  18. [18]

    Egidi, L. and N. Torelli (2021): Comparing goal-based and result-based approaches in modelling football outcomes, Social Indicators Research, 156, 801--813, ://doi.org/10.1007/s11205-020-02293-z

  19. [19]

    Elo, A. E. (1978): The Rating of chessplayers, past and present, New York, NY, USA: Arco Publishing Inc

  20. [20]

    eloratings.net (2020): World football E lo ratings, ://www.eloratings.net/

  21. [21]

    FIDE (2019): International chess federation: ratings change calculator, ://ratings.fide.com/calculator_rtd.phtml

  22. [22]

    FIFA (2018): Revision of the FIFA / C oca- C ola world ranking, ://digitalhub.fifa.com/m/f99da4f73212220/original/edbm045h0udbwkqew35a-pdf.pdf

  23. [23]

    FiveThirtyEight (2020): How our NFL predictions work, ://fivethirtyeight.com/methodology/how-our-nfl-predictions-work/

  24. [24]

    Football Rankings (2026): Football rankings, ://www.football-rankings.info/

  25. [25]

    Hwang, and A

    Gelman, A., J. Hwang, and A. Vehtari (2014): Understanding predictive information criteria for B ayesian models, Statistics and Computing, 24, 997--1016, ://doi.org/10.1007/s11222-013-9416-2

  26. [26]

    Glickman, M. E. (1995): Chess rating systems, American Chess Journal, 3, 59--102, ://www.chabris.com/pub/acj/3/AmericanChessJournalIssue3.pdf

  27. [27]

    Glickman, M. E. (1999): Parameter estimation in large dynamic paired comparison experiments, Journal of the Royal Statistical Society: Series C (Applied Statistics), 48, 377--394, ://dx.doi.org/10.1111/1467-9876.00159

  28. [28]

    Glickman, M. E. (2025): Paired comparison models with strength-dependent ties and order effects, ://arxiv.org/abs/2505.24783

  29. [29]

    Szczecinski, E

    Gomes de Pinho Zanco, D., L. Szczecinski, E. Vinicius Kuhn, and R. Seara (2024): Stochastic analysis of the Elo rating algorithm in round-robin tournaments, Digital Signal Processing, 145, 104313, ://www.sciencedirect.com/science/article/pii/S1051200423004086

  30. [30]

    Hvattum, L. M. and H. Arntzen (2010): Using Elo ratings for match result prediction in association football, International Journal of Forecasting, 26, 460 -- 470, ://www.sciencedirect.com/science/article/pii/S0169207009001708, sports Forecasting

  31. [31]

    (2021): How to extend E lo: a B ayesian perspective, Journal of Quantitative Analysis in Sports, 17, 203--219, ://doi.org/10.1515/jqas-2020-0066

    Ingram, M. (2021): How to extend E lo: a B ayesian perspective, Journal of Quantitative Analysis in Sports, 17, 203--219, ://doi.org/10.1515/jqas-2020-0066

  32. [32]

    Jabin, P.-E. and S. Junca (2015): A continuous model for ratings, SIAM J. Appl. Math, 2, 420--442, ://doi.org/10.1137/140969324

  33. [33]

    Karlis, D. and I. Ntzoufras (2008): Bayesian modelling of football outcomes: using the Skellam's distribution for the goal difference , IMA Journal of Management Mathematics, 20, 133--145, ://doi.org/10.1093/imaman/dpn026

  34. [34]

    Kir \'a ly , F. J. and Z. Qian (2017): Modelling Competitive Sports: B radley- T erry- E lo Models for Supervised and On-Line Learning of Paired Competition Outcomes , arXiv e-prints, arXiv:1701.08055

  35. [35]

    Kovalchik, S. (2020): Extension of the E lo rating system to margin of victory, International Journal of Forecasting, 36, 1329--1341, ://www.sciencedirect.com/science/article/pii/S0169207020300157

  36. [36]

    Langville, A. N. and C. D. Meyer (2012): Who's \#1, The Science of Rating and Ranking, Princeton University Press

  37. [37]

    Lapr \'e , M. A. and J. G. Amato (2025): The impact of imbalanced groups in uefa euro 1980--2024 and comparison with the fifa world cup, Journal of Quantitative Analysis in Sports, ://doi.org/10.1515/jqas-2024-0151

  38. [38]

    Lasek, J. and M. Gagolewski (2018): The efficacy of league formats in ranking teams, Statistical Modelling, 18, 411 -- 435

  39. [39]

    Lasek, J. and M. Gagolewski (2021): Interpretable sports team rating models based on the gradient descent algorithm, International Journal of Forecasting, 37, 1061--1071, ://www.sciencedirect.com/science/article/pii/S0169207020301849

  40. [40]

    Szl \'a vik, and S

    Lasek, J., Z. Szl \'a vik, and S. Bhulai (2013): The predictive power of ranking systems in association football, International Journal of Applied Pattern Recognition, 1, 27--46, ://www.inderscienceonline.com/doi/abs/10.1504/IJAPR.2013.052339, pMID: 52339

  41. [41]

    Morel-Balbi, S. and A. Kirkley (2025): Estimation of partial rankings from sparse, noisy comparisons, Communications Physics, 9

  42. [42]

    Newman, M. E. J. (2023): Efficient computation of rankings from pairwise comparisons, Journal of Machine Learning Research, 24, 1--25, ://jmlr.org/papers/v24/22-1086.html

  43. [43]

    Rao, P. V. and L. L. Kupper (1967): Ties in paired-comparison experiments: A generalization of the B radley- T erry model, Journal of the American Statistical Association, 62, 194--204, ://amstat.tandfonline.com/doi/abs/10.1080/01621459.1967.10482901

  44. [44]

    (2011): The E lo rating system -- correcting the expectancy tables, Technical report, ://en.chessbase.com/post/the-elo-rating-system-correcting-the-expectancy-tables

    Sonas, J. (2011): The E lo rating system -- correcting the expectancy tables, Technical report, ://en.chessbase.com/post/the-elo-rating-system-correcting-the-expectancy-tables

  45. [45]

    Szczecinski, L. (2022): G- E lo: generalization of the E lo algorithm by modeling the discretized margin of victory, Journal of Quantitative Analysis in Sports, 18, 1--14, ://doi.org/10.1515/jqas-2020-0115

  46. [46]

    Szczecinski, L. and A. Djebbi (2020): Understanding draws in Elo rating algorithm, Journal of Quantitative Analysis in Sports, 16, 211--220, ://www.degruyter.com/document/doi/10.1515/jqas-2019-0102/html

  47. [47]

    and I.-I

    Szczecinski, L. and I.-I. Roatis (2022): FIFA ranking: Evaluation and path forward, Journal of Sports Analytics, 8, 231--250, ://content.iospress.com/articles/journal-of-sports-analytics/jsa200619

  48. [48]

    Szczecinski, L. and R. Tihon (2023): Simplified K alman filter for online rating: one-fits-all approach, Journal of Quantitative Analysis in Sports, 19, 295--315, ://arxiv.org/abs/2104.14012, https://doi.org/10.1515/jqas-2021-0061

  49. [49]

    Thurston, L. L. (1927): A law of comparative judgement, Psychological Review, 34, 273--286

  50. [50]

    (2020): A taxonomy of polytomous item response models, ://arxiv.org/abs/2010.01382.pdf

    Tutz, G. (2020): A taxonomy of polytomous item response models, ://arxiv.org/abs/2010.01382.pdf