arxiv: 2604.03840 · v1 · submitted 2026-04-04 · 📊 stat.ME · cs.LG

Recognition: no theorem link

New insights into Elo algorithm for practitioners and statisticians

Leszek Szczecinski

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:56 UTC · model grok-4.3

classification 📊 stat.ME cs.LG

keywords Elo algorithmranking systemsmaximum likelihood estimationstochastic gradient ascentlogistic functionestimation noiseFIFA rankingsconvergence diagnostic

0 comments

The pith

Elo's heuristic and statistical views align exactly only for logistic expected scores, but estimation noise requires decoupling the ranking model from the prediction model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reconciles the common view of the Elo algorithm as a simple feedback rule for updating team ratings with its interpretation as online maximum likelihood estimation performed by stochastic gradient ascent. These two perspectives match exactly when outcomes are binary and the expected score follows the logistic function. Estimation noise in the ratings forces the model optimized for ranking to be adjusted separately when used for predicting outcomes, by changing the effective scale and any home-field advantage parameter. Closed-form corrections are derived for binary cases and approximations for multilevel scores, leading to better predictions than the standard approach that reuses the same model for both tasks. When applied to six years of FIFA men's soccer data, the adjusted method shows that the ranking had not converged for most national teams.

Core claim

Both the practitioner's heuristic feedback rule and the statistician's online maximum likelihood estimation via stochastic gradient ascent coincide exactly in the binary case if and only if the expected score is the logistic function. Estimation noise forces a principled decoupling between the model used for ranking and the model used for prediction: the effective scale and home-field advantage parameter must be adjusted to account for the noise, with closed-form corrections and a data-driven identification procedure provided. For multilevel outcomes an exact relationship holds when outcome scores are uniformly spaced, but noise-aware approximations are preferred in general because they fit

What carries the argument

The noise-induced decoupling between the ranking model and the prediction model, implemented through closed-form adjustments to the scale parameter and home-field advantage.

If this is right

The decoupled approach yields substantially better predictions than reusing the ranking model directly for prediction.
The adjustment procedure acts as a diagnostic that reveals whether rating estimates have converged.
Closed-form corrections are available for binary outcomes while approximations handle general multilevel scores.
Application to FIFA data indicates that the ranking process had not converged for the vast majority of national teams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar decoupling may improve predictive accuracy in other heuristic rating systems that rely on online gradient-style updates.
The data-driven identification procedure could be applied routinely by practitioners to tune parameters on their own competition data.
The convergence diagnostic might be used to decide when to stop updating ratings in ongoing tournaments or leagues.

Load-bearing premise

That the effects of estimation noise can be accurately captured and corrected by the derived closed-form adjustments without introducing new biases.

What would settle it

If the decoupled model's out-of-sample prediction accuracy on held-out match data is no better than the conventional model's, or if the adjusted scale and home-field values differ substantially from those identified directly from the same data, the need for decoupling would be challenged.

Figures

Figures reproduced from arXiv: 2604.03840 by Leszek Szczecinski.

**Figure 2.** Figure 2: Conditional probability functions Py(z/s + η) (97) defining the AC model with α and δ given in (101) and (102), s = 174, and η = 0.8. The solid thick line denotes the expected value of the score, G(z/s + η), given in (104) and solid dashed line denotes the approximation of the latter using a canonical function L(z/s˜ + ˜η) with ˜s and ˜η in (113) and (58). Examples of the function Py(z) are shown in [PITH… view at source ↗

**Figure 3.** Figure 3: Comparison between G(z/s) and its approximation L(z/s˜), for L = 3, ˜s = sβAC→L, βAC→L given in (115), δ = [0, 0.5, 1], α = [0, α1, 0], where the values of α1 are given in the legend; s = 174. For smaller values of z, the curves practically superimpose. For α1 = log 2 ≈ 0.7, we have a true equivalence of the expected scores, i.e., G(z/s) = L(z/s˜), where ˜s = 2s. Motivated by analysis which follows (110), … view at source ↗

**Figure 4.** Figure 4: Skills (left axis) of the team, from the best to the worst, (thin [PITH_FULL_IMAGE:figures/full_fig_p036_4.png] view at source ↗

**Figure 5.** Figure 5: Percentage of international FIFA teams which have played at least [PITH_FULL_IMAGE:figures/full_fig_p037_5.png] view at source ↗

read the original abstract

This work reconciles two perspectives on the Elo ranking that coexist in the literature: the practitioner's view as a heuristic feedback rule, and the statistician's view as online maximum likelihood estimation via stochastic gradient ascent. Both perspectives coincide exactly in the binary case (iff the expected score is the logistic function). However, estimation noise forces a principled decoupling between the model used for ranking and the model used for prediction: the effective scale and home-field advantage parameter must be adjusted to account for the noise. We provide both closed-form corrections and a data-driven identification procedure. For multilevel outcomes, an exact relationship exists when outcome scores are uniformly spaced, but approximations are preferred in general: they account for estimation noise and better fit the data. The decoupled approach substantially outperforms the conventional one that reuses the ranking model for prediction, and serves as a diagnostic of convergence status. Applied to six years of FIFA men's ranking, we find that the ranking had not converged for the vast majority of national teams. The paper is written in a semi-tutorial style accessible to practitioners, with all key results accompanied by closed-form expressions and numerical examples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Elo's heuristic and MLE views match exactly for binary outcomes but require closed-form noise corrections to scale and home-field for accurate prediction, with a FIFA application showing non-convergence.

read the letter

The paper's core result is that the standard Elo update coincides exactly with online MLE when outcomes are binary and the expected score follows the logistic, but estimation noise requires decoupling the ranking model from the prediction model via adjusted scale and home-field parameters. Closed-form corrections and a data-driven identification method are supplied for this adjustment, along with approximations for multilevel scores that account for noise and fit data better than reusing the ranking parameters directly. Applied to FIFA men's rankings over six years, the adjusted version indicates most teams had not converged and outperforms the conventional approach on prediction tasks. The semi-tutorial presentation with explicit expressions and examples makes the material usable for practitioners. The derivations appear clean and the binary-case coincidence is a useful clarification. The main limitation is that the corrections rest on a specific noise model tied to the logistic or Gaussian form, plus independence across matches; deviations from uniform spacing in multilevel outcomes push the work to approximations that could introduce bias if those assumptions fail. The FIFA outperformance is presented as evidence but would benefit from more detail on sensitivity and error analysis to confirm robustness. This is useful for anyone maintaining or analyzing rating systems in sports, esports, or ML who wants a principled tweak rather than ad-hoc fixes. It deserves peer review because the math is explicit, the empirical claim is testable, and the practical payoff is clear.

Referee Report

2 major / 2 minor

Summary. The paper reconciles the heuristic feedback-rule view of Elo with its interpretation as online MLE via stochastic gradient ascent, showing exact coincidence in the binary logistic case. It argues that estimation noise necessitates decoupling the ranking model from the prediction model, supplying closed-form corrections (and a data-driven procedure) for the effective scale and home-field parameters; for multilevel scores it offers approximations that incorporate noise and improve fit. The decoupled approach is reported to outperform the conventional reuse of the ranking model on FIFA data and to diagnose non-convergence for most national teams.

Significance. If the closed-form noise corrections are valid, the work supplies a principled, practitioner-accessible improvement to Elo that separates ranking from prediction, yields a convergence diagnostic, and demonstrates measurable gains on real sports data. The explicit reconciliation of the two literatures and the provision of closed-form expressions are genuine strengths.

major comments (2)

[Abstract and section on closed-form corrections] The central claim that estimation noise admits accurate closed-form corrections to the scale and home-field parameters (derived from the stochastic-gradient model) rests on an implicit noise distribution whose validity is asserted but not rigorously tested beyond the provided examples. Any departure from the modeled form, non-uniform spacing of multilevel scores, or violation of outcome independence would turn the adjustment into a source of bias rather than a correction.
[FIFA application] The FIFA application concludes that rankings had not converged for the vast majority of teams; this diagnosis depends on the adjusted parameters correctly identifying non-convergence, yet the manuscript supplies neither a formal error analysis of the data-driven identification procedure nor cross-validation against held-out matches.

minor comments (2)

The semi-tutorial style is helpful, but the numerical examples would benefit from explicit step-by-step derivation of the closed-form expressions rather than final results only.
Notation for the effective scale parameter versus the original scale parameter should be introduced once and used consistently to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of our derivations. We address each major point below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract and section on closed-form corrections] The central claim that estimation noise admits accurate closed-form corrections to the scale and home-field parameters (derived from the stochastic-gradient model) rests on an implicit noise distribution whose validity is asserted but not rigorously tested beyond the provided examples. Any departure from the modeled form, non-uniform spacing of multilevel scores, or violation of outcome independence would turn the adjustment into a source of bias rather than a correction.

Authors: The closed-form corrections are derived exactly from the stochastic-gradient update rule under the logistic model, where the effective noise distribution is induced by the finite-sample parameter updates rather than posited separately. We agree that the manuscript would benefit from more explicit discussion of the assumptions (outcome independence and the form of the induced noise) and from additional validation. In revision we will add a subsection on the derivation assumptions together with simulation experiments that assess sensitivity to mild violations of independence and non-uniform score spacing. These additions will not change the core closed-form expressions but will make their domain of applicability clearer. revision: partial
Referee: [FIFA application] The FIFA application concludes that rankings had not converged for the vast majority of teams; this diagnosis depends on the adjusted parameters correctly identifying non-convergence, yet the manuscript supplies neither a formal error analysis of the data-driven identification procedure nor cross-validation against held-out matches.

Authors: The non-convergence conclusion follows directly from comparing the data-driven estimates of the effective scale and home-field parameters against the values implied by the ranking model. While the procedure itself is fully specified, we acknowledge that a formal error analysis and explicit cross-validation on held-out matches are absent. In the revised manuscript we will include a cross-validation exercise that holds out recent matches, re-estimates the effective parameters on the training window, and checks whether the adjusted model yields improved predictive accuracy on the held-out set; we will also report the variability of the identified convergence status across different training-window lengths. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via MLE equivalence and noise model

full rationale

The paper derives the exact coincidence of heuristic Elo and online MLE in the binary logistic case directly from the definitions of stochastic gradient ascent and the logistic expected-score function, without fitting or self-reference. Closed-form noise corrections for scale and home-field advantage follow from the same stochastic-gradient model under explicit noise assumptions, yielding independent adjustments rather than refits. The data-driven identification procedure is presented as an optional supplement, not the load-bearing step for the decoupling claim or the FIFA non-convergence diagnosis. No step reduces by construction to its inputs, no self-citation chain is load-bearing, and the outperformance result is tied to the derived expressions rather than tautological renaming or ansatz smuggling.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the logistic function as a domain assumption for exact coincidence and introduces adjusted scale and home-field parameters as free parameters to correct for estimation noise derived from the MLE framework.

free parameters (2)

effective scale parameter
Adjusted to account for estimation noise when using the model for prediction rather than ranking
home-field advantage parameter
Adjusted to account for estimation noise in the decoupled prediction model

axioms (2)

domain assumption The expected score follows the logistic function
Required for the exact coincidence between heuristic and MLE perspectives in the binary case
ad hoc to paper Estimation noise can be corrected via closed-form expressions derived from the stochastic gradient model
Central to the proposed decoupling between ranking and prediction models

pith-pipeline@v0.9.0 · 5486 in / 1483 out tokens · 48807 ms · 2026-05-13T16:56:02.685901+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

[1]

write newline

" write newline "" initialize.prev.this.status FUNCTION begin.bib " write newline preamble empty 'skip preamble write newline if " thebibliography " longest.label * " " * write newline " [1] #1 " write newline " url@samestyle " write newline " " write newline " [2] #2 " write newline " =0pt " write newline " " ALTinterwordstretchfactor * " " * write newli...

work page
[2]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence...

work page
[3]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in " " * FUNCTION format....

work page
[4]

(2026): Soccerway, ://www.soccerway.com

work page 2026
[5]

(2017): Elo ratings and the sports model: A neglected topic in applied probability? Statist

Aldous, D. (2017): Elo ratings and the sports model: A neglected topic in applied probability? Statist. Sci., 32, 616--629, ://doi.org/10.1214/17-STS628

work page doi:10.1214/17-sts628 2017
[6]

Candila, and L

Angelini, G., V. Candila, and L. De Angelis (2021): Weighted E lo rating for tennis match predictions, European Journal of Operational Research, ://www.sciencedirect.com/science/article/pii/S0377221721003234

work page 2021
[7]

Anthropic (2026): Claude, ://www.anthropic.com

work page 2026
[8]

(2012): Bayesian reasoning and Machine Learning, Cambridge University Press

Barber, D. (2012): Bayesian reasoning and Machine Learning, Cambridge University Press

work page 2012
[9]

Marmulla, and I

Brandes, U., G. Marmulla, and I. Smokovic (2025): Efficient computation of tournament winning probabilities, Journal of Sports Analytics, 11, 22150218251313905, ://doi.org/10.1177/22150218251313905

work page doi:10.1177/22150218251313905 2025
[10]

Cortez, R. and H. Tossounian (2026): Convergence and stationary distribution of E lo rating systems, ://arxiv.org/abs/2410.09180

work page arXiv 2026
[11]

Csat \'o , L. (2023): Quantifying the unfairness of the 2018 FIFA W orld C up qualification, International Journal of Sports Science & Coaching, 18, 183--196, ://doi.org/10.1177/17479541211073455

work page doi:10.1177/17479541211073455 2023
[12]

Csat \'o , L. (2024): Club coefficients in the UEFA champions league: Time for shift to an Elo -based formula, International Journal of Performance Analysis in Sport, 24, 119--134, ://doi.org/10.1080/24748668.2023.2274221

work page doi:10.1080/24748668.2023.2274221 2024
[13]

(1972): Estimating item parameters and latent ability when responses are scored in two or more nominal categories, Psychometrika, 37, 29--51, ://doi.org/10.1007/BF02291411

Darrell Bock, R. (1972): Estimating item parameters and latent ability when responses are scored in two or more nominal categories, Psychometrika, 37, 29--51, ://doi.org/10.1007/BF02291411

work page doi:10.1007/bf02291411 1972
[14]

Davidson, R. R. (1970): On extending the B radley- T erry model to accommodate ties in paired comparison experiments, Journal of the American Statistical Association, 65, 317--328, ://www.jstor.org/stable/2283595

work page arXiv 1970
[15]

Davidson, R. R. and R. J. Beaver (1977): On extending the B radley- T erry model to incorporate within-pair order effects, Biometrics, 33, 693--702

work page 1977
[16]

Egidi, L. and I. Ntzoufras (2020): A Bayesian Quest for Finding a Unified Model for Predicting Volleyball Games , Journal of the Royal Statistical Society Series C: Applied Statistics, 69, 1307--1336, ://doi.org/10.1111/rssc.12436

work page doi:10.1111/rssc.12436 2020
[17]

Pauli, and N

Egidi, L., F. Pauli, and N. Torelli (2018): Combining historical data and bookmakers' odds in modelling football scores, Statistical Modelling, 18, 436--459, ://doi.org/10.1177/1471082X18798414

work page doi:10.1177/1471082x18798414 2018
[18]

Egidi, L. and N. Torelli (2021): Comparing goal-based and result-based approaches in modelling football outcomes, Social Indicators Research, 156, 801--813, ://doi.org/10.1007/s11205-020-02293-z

work page doi:10.1007/s11205-020-02293-z 2021
[19]

Elo, A. E. (1978): The Rating of chessplayers, past and present, New York, NY, USA: Arco Publishing Inc

work page 1978
[20]

eloratings.net (2020): World football E lo ratings, ://www.eloratings.net/

work page 2020
[21]

FIDE (2019): International chess federation: ratings change calculator, ://ratings.fide.com/calculator_rtd.phtml

work page 2019
[22]

FIFA (2018): Revision of the FIFA / C oca- C ola world ranking, ://digitalhub.fifa.com/m/f99da4f73212220/original/edbm045h0udbwkqew35a-pdf.pdf

work page 2018
[23]

FiveThirtyEight (2020): How our NFL predictions work, ://fivethirtyeight.com/methodology/how-our-nfl-predictions-work/

work page 2020
[24]

Football Rankings (2026): Football rankings, ://www.football-rankings.info/

work page 2026
[25]

Hwang, and A

Gelman, A., J. Hwang, and A. Vehtari (2014): Understanding predictive information criteria for B ayesian models, Statistics and Computing, 24, 997--1016, ://doi.org/10.1007/s11222-013-9416-2

work page doi:10.1007/s11222-013-9416-2 2014
[26]

Glickman, M. E. (1995): Chess rating systems, American Chess Journal, 3, 59--102, ://www.chabris.com/pub/acj/3/AmericanChessJournalIssue3.pdf

work page 1995
[27]

Glickman, M. E. (1999): Parameter estimation in large dynamic paired comparison experiments, Journal of the Royal Statistical Society: Series C (Applied Statistics), 48, 377--394, ://dx.doi.org/10.1111/1467-9876.00159

work page doi:10.1111/1467-9876.00159 1999
[28]

Glickman, M. E. (2025): Paired comparison models with strength-dependent ties and order effects, ://arxiv.org/abs/2505.24783

work page arXiv 2025
[29]

Szczecinski, E

Gomes de Pinho Zanco, D., L. Szczecinski, E. Vinicius Kuhn, and R. Seara (2024): Stochastic analysis of the Elo rating algorithm in round-robin tournaments, Digital Signal Processing, 145, 104313, ://www.sciencedirect.com/science/article/pii/S1051200423004086

work page 2024
[30]

Hvattum, L. M. and H. Arntzen (2010): Using Elo ratings for match result prediction in association football, International Journal of Forecasting, 26, 460 -- 470, ://www.sciencedirect.com/science/article/pii/S0169207009001708, sports Forecasting

work page 2010
[31]

(2021): How to extend E lo: a B ayesian perspective, Journal of Quantitative Analysis in Sports, 17, 203--219, ://doi.org/10.1515/jqas-2020-0066

Ingram, M. (2021): How to extend E lo: a B ayesian perspective, Journal of Quantitative Analysis in Sports, 17, 203--219, ://doi.org/10.1515/jqas-2020-0066

work page doi:10.1515/jqas-2020-0066 2021
[32]

Jabin, P.-E. and S. Junca (2015): A continuous model for ratings, SIAM J. Appl. Math, 2, 420--442, ://doi.org/10.1137/140969324

work page doi:10.1137/140969324 2015
[33]

Karlis, D. and I. Ntzoufras (2008): Bayesian modelling of football outcomes: using the Skellam's distribution for the goal difference , IMA Journal of Management Mathematics, 20, 133--145, ://doi.org/10.1093/imaman/dpn026

work page doi:10.1093/imaman/dpn026 2008
[34]

Kir \'a ly , F. J. and Z. Qian (2017): Modelling Competitive Sports: B radley- T erry- E lo Models for Supervised and On-Line Learning of Paired Competition Outcomes , arXiv e-prints, arXiv:1701.08055

work page arXiv 2017
[35]

Kovalchik, S. (2020): Extension of the E lo rating system to margin of victory, International Journal of Forecasting, 36, 1329--1341, ://www.sciencedirect.com/science/article/pii/S0169207020300157

work page 2020
[36]

Langville, A. N. and C. D. Meyer (2012): Who's \#1, The Science of Rating and Ranking, Princeton University Press

work page 2012
[37]

Lapr \'e , M. A. and J. G. Amato (2025): The impact of imbalanced groups in uefa euro 1980--2024 and comparison with the fifa world cup, Journal of Quantitative Analysis in Sports, ://doi.org/10.1515/jqas-2024-0151

work page doi:10.1515/jqas-2024-0151 2025
[38]

Lasek, J. and M. Gagolewski (2018): The efficacy of league formats in ranking teams, Statistical Modelling, 18, 411 -- 435

work page 2018
[39]

Lasek, J. and M. Gagolewski (2021): Interpretable sports team rating models based on the gradient descent algorithm, International Journal of Forecasting, 37, 1061--1071, ://www.sciencedirect.com/science/article/pii/S0169207020301849

work page 2021
[40]

Szl \'a vik, and S

Lasek, J., Z. Szl \'a vik, and S. Bhulai (2013): The predictive power of ranking systems in association football, International Journal of Applied Pattern Recognition, 1, 27--46, ://www.inderscienceonline.com/doi/abs/10.1504/IJAPR.2013.052339, pMID: 52339

work page doi:10.1504/ijapr.2013.052339 2013
[41]

Morel-Balbi, S. and A. Kirkley (2025): Estimation of partial rankings from sparse, noisy comparisons, Communications Physics, 9

work page 2025
[42]

Newman, M. E. J. (2023): Efficient computation of rankings from pairwise comparisons, Journal of Machine Learning Research, 24, 1--25, ://jmlr.org/papers/v24/22-1086.html

work page 2023
[43]

Rao, P. V. and L. L. Kupper (1967): Ties in paired-comparison experiments: A generalization of the B radley- T erry model, Journal of the American Statistical Association, 62, 194--204, ://amstat.tandfonline.com/doi/abs/10.1080/01621459.1967.10482901

work page doi:10.1080/01621459.1967.10482901 1967
[44]

(2011): The E lo rating system -- correcting the expectancy tables, Technical report, ://en.chessbase.com/post/the-elo-rating-system-correcting-the-expectancy-tables

Sonas, J. (2011): The E lo rating system -- correcting the expectancy tables, Technical report, ://en.chessbase.com/post/the-elo-rating-system-correcting-the-expectancy-tables

work page 2011
[45]

Szczecinski, L. (2022): G- E lo: generalization of the E lo algorithm by modeling the discretized margin of victory, Journal of Quantitative Analysis in Sports, 18, 1--14, ://doi.org/10.1515/jqas-2020-0115

work page doi:10.1515/jqas-2020-0115 2022
[46]

Szczecinski, L. and A. Djebbi (2020): Understanding draws in Elo rating algorithm, Journal of Quantitative Analysis in Sports, 16, 211--220, ://www.degruyter.com/document/doi/10.1515/jqas-2019-0102/html

work page doi:10.1515/jqas-2019-0102/html 2020
[47]

and I.-I

Szczecinski, L. and I.-I. Roatis (2022): FIFA ranking: Evaluation and path forward, Journal of Sports Analytics, 8, 231--250, ://content.iospress.com/articles/journal-of-sports-analytics/jsa200619

work page 2022
[48]

Szczecinski, L. and R. Tihon (2023): Simplified K alman filter for online rating: one-fits-all approach, Journal of Quantitative Analysis in Sports, 19, 295--315, ://arxiv.org/abs/2104.14012, https://doi.org/10.1515/jqas-2021-0061

work page doi:10.1515/jqas-2021-0061 2023
[49]

Thurston, L. L. (1927): A law of comparative judgement, Psychological Review, 34, 273--286

work page 1927
[50]

(2020): A taxonomy of polytomous item response models, ://arxiv.org/abs/2010.01382.pdf

Tutz, G. (2020): A taxonomy of polytomous item response models, ://arxiv.org/abs/2010.01382.pdf

work page arXiv 2020