arxiv: 2604.05460 · v1 · submitted 2026-04-07 · 📊 stat.ME · cs.AI

Recognition: no theorem link

LLM Evaluation as Tensor Completion: Low Rank Structure and Semiparametric Efficiency

Jiachun Li , David Simchi-Levi , Will Wei Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:35 UTC · model grok-4.3

classification 📊 stat.ME cs.AI

keywords LLM evaluationtensor completionsemiparametric inferencelow-rank tensorBradley-Terry-Luce modelefficient influence functionscore whiteningpairwise comparisons

0 comments

The pith

Pairwise LLM judgments modeled as low-rank tensor observations admit semiparametric efficient estimators for ability gaps and win probabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames LLM evaluation data as sparse pairwise comparisons generated from a low-rank latent score tensor under Bradley-Terry-Luce models. It derives the information operator restricted to the low-rank tangent space, obtains the efficient influence function, and establishes the semiparametric efficiency bound for smooth functionals of the tensor. A one-step debiased estimator is constructed that attains asymptotic normality. The central device is a score-whitening transformation that equalizes local Fisher information to restore stable inference despite the anisotropic information operator and non-uniform sampling.

Core claim

For a low-rank latent score tensor observed through sparse pairwise comparisons under Bradley-Terry-Luce-type models, the semiparametric efficiency bound for any smooth functional can be attained by a one-step estimator built from the efficient influence function on the low-rank tangent space, once a score-whitening step compensates for the fact that the information operator does not commute with the tangent-space projection.

What carries the argument

efficient influence function on the low-rank tangent space of the score tensor, together with the score-whitening transformation that equalizes anisotropic Fisher information

If this is right

Ability gaps and win probabilities between models admit asymptotically normal estimators with valid confidence intervals.
The procedure remains valid under the sparse and non-uniform sampling patterns typical of real LLM evaluation platforms.
Uncertainty quantification becomes feasible for leaderboard rankings without assuming uniform observation probabilities.
The same efficiency framework applies to any pairwise comparison data whose latent scores admit low-rank structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could supply confidence intervals for rankings in sports or product recommendation settings that rely on pairwise outcomes.
LLM platforms could replace point-estimate leaderboards with intervals that reflect the actual information in sparse human judgments.
Empirical checks on large judgment datasets could verify whether real score tensors are close enough to low-rank for the efficiency gains to materialize.

Load-bearing premise

The latent score tensor has low-rank structure and the pairwise comparisons follow Bradley-Terry-Luce-type models with sparse non-uniform observations.

What would settle it

On data simulated from a known low-rank tensor under the Bradley-Terry-Luce model, the empirical variance of the one-step estimator would have to match the derived semiparametric efficiency bound at large sample sizes; persistent mismatch would falsify the efficiency claim.

read the original abstract

Large language model (LLM) evaluation platforms increasingly rely on pairwise human judgments. These data are noisy, sparse, and non-uniform, yet leaderboards are reported with limited uncertainty quantification. We study this as semiparametric inference for a low-rank latent score tensor observed through pairwise comparisons under Bradley-Terry-Luce-type models. This places LLM evaluation in a new tensor completion setting with structured observations, non-uniform sampling, and pairwise contrasts. Our target is a smooth functional $\psi(T^\star)$, including linear estimands such as ability gaps and nonlinear ones such as win probabilities. We derive the information operator on the low-rank tangent space, the efficient influence function, and the semiparametric efficiency bound, then construct a one-step debiased estimator with asymptotic normality. A central challenge is that the information operator is anisotropic and does not commute with the tangent-space projection, creating a bottleneck absent from isotropic models. We introduce a score-whitening method that equalizes local Fisher information and restores stable inference at the optimal sample-complexity scale. Our results provide a principled framework for uncertainty quantification in LLM evaluation and more broadly for inference on low-rank structures from pairwise data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper frames LLM pairwise evaluations as low-rank tensor completion under BTL models and derives semiparametric efficiency bounds with a score-whitening fix for the anisotropic information operator.

read the letter

This paper treats LLM leaderboard construction from pairwise human judgments as a low-rank tensor completion problem under Bradley-Terry-Luce models. It derives the efficient influence function and semiparametric efficiency bound for functionals like ability differences and win probabilities, then builds a one-step debiased estimator. The key technical move is a score-whitening procedure that equalizes the local Fisher information to restore asymptotic normality at the right rate when the information operator is anisotropic and does not commute with the tangent-space projection.

Referee Report

2 major / 3 minor

Summary. The manuscript frames LLM evaluation from noisy, sparse, non-uniform pairwise human judgments as semiparametric inference on a low-rank latent score tensor under Bradley-Terry-Luce-type models. It derives the information operator restricted to the low-rank tangent space, the efficient influence function, and the semiparametric efficiency bound for smooth functionals such as ability gaps and win probabilities. A one-step debiased estimator is constructed and shown to achieve asymptotic normality; a score-whitening procedure is introduced to equalize local Fisher information and restore stable inference under the anisotropic information operator that arises from non-uniform sampling.

Significance. If the derivations hold, the work supplies the first rigorous efficiency theory and uncertainty quantification for LLM leaderboards, replacing ad-hoc ranking with statistically grounded inference. The low-rank tensor completion setting with structured pairwise observations is novel, and the score-whitening technique addresses a genuine technical obstacle (non-commuting anisotropic operator and tangent-space projection) that is absent from isotropic models. The framework is extensible to other pairwise-data problems and supplies falsifiable asymptotic predictions once the estimator is implemented.

major comments (2)

[Section 4 (estimator construction) and Theorem on asymptotic normality] The central asymptotic normality claim for the one-step estimator (presumably Theorem 4 or 5) rests on the score-whitening step restoring the efficient influence function after projection onto the low-rank tangent space. The manuscript should explicitly verify that the whitened score remains orthogonal to the nuisance tangent space under the stated sparsity and non-uniformity conditions; otherwise the efficiency bound may not be attained at the optimal sample-complexity rate.
[Section 3 (information operator) and Assumption on sampling design] The low-rank tangent space projection is used to derive the information operator, but the manuscript must confirm that the resulting operator remains invertible on the identifiable subspace when the sampling probabilities are highly non-uniform (as is typical in LLM platforms). If the minimal eigenvalue bound depends on the unknown low-rank factors, the efficiency claim becomes conditional rather than uniform.

minor comments (3)

[Section 2 (model)] Notation for the latent score tensor T* and the observed comparison tensor should be introduced with a single consistent symbol set in the model section to avoid confusion between the full tensor and its low-rank factorization.
[Introduction and Section 3] The abstract states that the information operator 'does not commute with the tangent-space projection'; this should be illustrated with a small numerical example or a low-dimensional analytic counter-example in the main text so readers can see the anisotropy concretely.
[Related work] References to prior tensor-completion and semiparametric efficiency literature (e.g., on pairwise ranking models) are present but could be expanded with one or two additional citations on anisotropic information operators in structured models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments on the technical details of our asymptotic results. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Section 4 (estimator construction) and Theorem on asymptotic normality] The central asymptotic normality claim for the one-step estimator (presumably Theorem 4 or 5) rests on the score-whitening step restoring the efficient influence function after projection onto the low-rank tangent space. The manuscript should explicitly verify that the whitened score remains orthogonal to the nuisance tangent space under the stated sparsity and non-uniformity conditions; otherwise the efficiency bound may not be attained at the optimal sample-complexity rate.

Authors: We thank the referee for this observation. The proof of Theorem 4 establishes that the whitened score is orthogonal to the nuisance tangent space by exploiting the fact that the whitening operator is constructed to preserve the range of the low-rank projection while equalizing the local information; the argument relies on the sparsity and non-uniform sampling conditions in Assumptions 3.1 and 3.3 together with the boundedness of the latent factors. To make this verification more transparent, we will insert a dedicated lemma immediately preceding Theorem 4 that isolates the orthogonality property and summarizes the key algebraic steps from the appendix. revision: yes
Referee: [Section 3 (information operator) and Assumption on sampling design] The low-rank tangent space projection is used to derive the information operator, but the manuscript must confirm that the resulting operator remains invertible on the identifiable subspace when the sampling probabilities are highly non-uniform (as is typical in LLM platforms). If the minimal eigenvalue bound depends on the unknown low-rank factors, the efficiency claim becomes conditional rather than uniform.

Authors: Assumption 3.2 already imposes a uniform lower bound on the minimal eigenvalue of the restricted information operator that is independent of the particular low-rank factors; the bound is expressed solely in terms of the sampling probabilities and the uniform boundedness of the latent scores (Assumption 2.1). Under the non-uniform designs typical of LLM platforms, this bound remains positive and uniform over the parameter space. We will revise the wording of Assumption 3.2 and add a short remark in Section 3 that explicitly states the uniformity of the eigenvalue bound and its implications for the efficiency claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper applies standard semiparametric efficiency theory to derive the information operator on the low-rank tangent space, the efficient influence function, the semiparametric efficiency bound, and a one-step debiased estimator for the functional ψ(T★) under the stated low-rank latent tensor and BTL pairwise observation model. The score-whitening step is introduced explicitly to address the acknowledged anisotropy of the information operator. No derivation step reduces by construction to its inputs, no parameter is fitted on a subset and renamed as a prediction, and no load-bearing self-citation or imported uniqueness theorem is invoked in the provided text. The central results follow directly from the model assumptions without self-referential definitions or renaming of known empirical patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the low-rank structure of the latent score tensor and the Bradley-Terry-Luce model for observations. No free parameters are explicitly introduced or fitted beyond these structural assumptions, and no new entities are postulated. The score-whitening is a methodological adjustment derived from the information operator rather than an additional postulate.

axioms (2)

domain assumption The latent score tensor has low-rank structure
Invoked to enable completion and inference from sparse pairwise observations.
domain assumption Pairwise comparisons follow Bradley-Terry-Luce-type models
Used to link the observed comparisons to the underlying tensor scores.

pith-pipeline@v0.9.0 · 5511 in / 1543 out tokens · 65633 ms · 2026-05-10T19:35:50.316627+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Perturbation is All You Need for Extrapolating Language Models
stat.ML 2026-05 unverdicted novelty 6.0

Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.

Reference graph

Works this paper leans on

29 extracted references · 10 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

, Xiong , Zhihan Z

barticle [author] Bose , Avinandan A. , Xiong , Zhihan Z. , Chi , Yuejie Y. , Du , Simon Shaolei S. S. , Xiao , Lin L. Fazel , Maryam M. ( 2025 ). LoRe: Personalizing LLMs via Low-Rank Reward Modeling . arXiv preprint arXiv:2504.14439 . barticle

work page arXiv 2025
[2]

barticle [author] Bradley , Ralph Allan R. A. Terry , Milton E. M. E. ( 1952 ). Rank analysis of incomplete block designs: I. The method of paired comparisons . Biometrika 39 324--345 . barticle

1952
[3]

, Li , Gen G

barticle [author] Cai , Changxiao C. , Li , Gen G. , Poor , H. Vincent H. V. Chen , Yuxin Y. ( 2022 ). Nonconvex low-rank tensor completion from noisy data . Operations Research 70 1219--1237 . barticle

2022
[4]

barticle [author] Cand \`e s , Emmanuel J. E. J. Recht , Benjamin B. ( 2009 ). Exact matrix completion via convex optimization . Foundations of Computational Mathematics 9 717--772 . barticle

2009
[5]

, Huang , Longxiu L

barticle [author] Chao , Zehan Z. , Huang , Longxiu L. Needell , Deanna D. ( 2021 ). HOSVD-based algorithm for weighted tensor completion . Journal of Imaging 7 110 . barticle

2021
[6]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

barticle [author] Chiang , Wei-Lin W.-L. , Zheng , Lianmin L. , Sheng , Ying Y. , Angelopoulos , Anastasios Nikolas A. N. , Li , Tianle T. , Li , Dacheng D. , Zhang , Hao H. , Zhu , Banghua B. , Jordan , Michael M. , Gonzalez , Joseph E. J. E. Stoica , Ion I. ( 2024 ). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference . arXiv preprin...

work page internal anchor Pith review arXiv 2024
[7]

(2026), Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals, arXiv preprint arXiv:2602.03061

barticle [author] Dong , Zihan Z. , Zhang , Zhixian Z. , Zhou , Yang Y. , Jin , Can C. , Wu , Ruijia R. Zhang , Linjun L. ( 2026 ). Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals . arXiv preprint arXiv:2602.03061 . barticle

work page arXiv 2026
[8]

, Ma , Wanteng W

barticle [author] Duan , Congyuan C. , Ma , Wanteng W. , Xia , Dong D. Xu , Kan K. ( 2025 ). Statistical Inference for Matching Decisions via Matrix Completion under Dependent Missingness . arXiv preprint arXiv:2510.26478 . barticle

work page arXiv 2025
[9]

, Hou , Jikai J

barticle [author] Fan , Jianqing J. , Hou , Jikai J. Yu , Mengxin M. ( 2024 ). Uncertainty quantification of MLE for entity ranking with covariates . Journal of Machine Learning Research 25 1--83 . barticle

2024
[10]

arXiv preprint arXiv:2509.01847 , year=

barticle [author] Fan , Jianqing J. , Kwon , Hyukjun H. Zhu , Xiaonan X. ( 2025 ). Uncertainty Quantification for Ranking with Heterogeneous Preferences . arXiv preprint arXiv:2509.01847 . barticle

work page arXiv 2025
[11]

, Lou , Zhipeng Z

barticle [author] Fan , Jianqing J. , Lou , Zhipeng Z. , Wang , Weichen W. Yu , Mengxin M. ( 2026 ). Spectral ranking inferences based on general multiway comparisons . Operations Research 74 161--180 . barticle

2026
[12]

, Shen , Yandi Y

barticle [author] Gao , Chao C. , Shen , Yandi Y. Zhang , Anderson Y. A. Y. ( 2023 ). Uncertainty quantification in the Bradley--Terry--Luce model . Information and Inference: A Journal of the IMA 12 1073--1140 . barticle

2023
[13]

barticle [author] Keshavan , Raghunandan H. R. H. , Montanari , Andrea A. Oh , Sewoong S. ( 2010 ). Matrix completion from a few entries . IEEE Transactions on Information Theory 56 2980--2998 . barticle

2010
[14]

barticle [author] Kolda , Tamara G. T. G. Bader , Brett W. B. W. ( 2009 ). Tensor decompositions and applications . SIAM Review 51 455--500 . barticle

2009
[15]

, Lounici , Karim K

barticle [author] Koltchinskii , Vladimir V. , Lounici , Karim K. Tsybakov , Alexandre B. A. B. ( 2011 ). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion . The Annals of Statistics 39 2302--2329 . barticle

2011
[16]

, Chiang , Wei-Lin W.-L

bmisc [author] Li , Tianle T. , Chiang , Wei-Lin W.-L. , Frick , Evan E. , Dunlap , Lisa L. , Zhu , Banghua B. , Gonzalez , Joseph E. J. E. Stoica , Ion I. ( 2024 ). The Arena-Hard Pipeline . Arena blog . Published April 19, 2024 . bmisc

2024
[17]

arena-human-preference-140k

bmisc [author] LMSYS ( 2025 ). arena-human-preference-140k . bmisc

2025
[18]

Duncan R

bbook [author] Luce , R. Duncan R. D. ( 1959 ). Individual Choice Behavior: A Theoretical Analysis . Wiley . bbook

1959
[19]

Xia , Dong D

barticle [author] Ma , Wanteng W. Xia , Dong D. ( 2024 ). Statistical inference in tensor completion: Optimal uncertainty quantification and statistical-to-computational gaps . arXiv preprint arXiv:2410.11225 . barticle

work page arXiv 2024
[20]

, Chen , Song Xi S

barticle [author] Mao , Xiaojun X. , Chen , Song Xi S. X. Wong , Raymond K. W. R. K. W. ( 2019 ). Matrix completion with covariate information . Journal of the American Statistical Association 114 198--210 . barticle

2019
[21]

, Wang , Zhonglei Z

barticle [author] Mao , Xiaojun X. , Wang , Zhonglei Z. Yang , Shu S. ( 2023 ). Matrix completion under complex survey sampling . Annals of the Institute of Statistical Mathematics 75 463--492 . barticle

2023
[22]

Wainwright , Martin J

barticle [author] Negahban , Sahand S. Wainwright , Martin J. M. J. ( 2012 ). Restricted strong convexity and weighted matrix completion: Optimal bounds with noise . Journal of Machine Learning Research 13 1665--1697 . barticle

2012
[23]

, Wu , Jeffrey J

barticle [author] Ouyang , Long L. , Wu , Jeffrey J. , Jiang , Xu X. , Almeida , Diogo D. , Wainwright , Carroll C. , Mishkin , Pamela P. , Zhang , Chong C. , Agarwal , Sandhini S. , Slama , Katarina K. , Ray , Alex A. et al. ( 2022 ). Training language models to follow instructions with human feedback . Advances in neural information processing systems 3...

2022
[24]

, Gordon , Andrew A

barticle [author] Petrova , Nora N. , Gordon , Andrew A. Blindow , Enzo E. ( 2026 ). Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework . arXiv preprint arXiv:2603.04409 . barticle

work page arXiv 2026
[25]

InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Sin- gapore

barticle [author] Singh , Shivalika S. , Nan , Yiyang Y. , Wang , Alex A. , D'souza , Daniel D. , Kapoor , Sayash S. , \"U st \"u n , Ahmet A. , Koyejo , Sanmi S. , Deng , Yuntian Y. , Longpre , Shayne S. , Smith , Noah A N. A. et al. ( 2025 ). The leaderboard illusion . arXiv preprint arXiv:2504.20879 . barticle

work page arXiv 2025
[26]

( 2026 )

barticle [author] Su , Weijie W. ( 2026 ). Do large language models (really) need statistical foundations? The Annals of Applied Statistics 20 724--743 . barticle

2026
[27]

Arena-Rank: Open Sourcing the Leaderboard Methodology

bmisc [author] Arena Team ( 2025 ). Arena-Rank: Open Sourcing the Leaderboard Methodology . Arena blog . Published December 18, 2025 . bmisc

2025
[28]

, Ye , Kai K

barticle [author] Xu , Erhan E. , Ye , Kai K. , Zhou , Hongyi H. , Zhu , Luhan L. , Quinzan , Francesco F. Shi , Chengchun C. ( 2025 ). Doubly robust alignment for large language models . arXiv preprint arXiv:2506.01183 . barticle

work page arXiv 2025
[29]

, Cai , Biao B

barticle [author] Zhang , Maoyu M. , Cai , Biao B. , Sun , Will Wei W. W. Zhang , Jingfei J. ( 2025 ). Generalized tensor completion with non-random missingness . arXiv preprint arXiv:2509.06225 . barticle

work page arXiv 2025