pith. machine review for the scientific record. sign in

arxiv: 2605.10405 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: no theorem link

Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization

Elad Tolochinsky, Yaniv Romano, Yaniv Tenzer

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:01 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM evaluationbest model identificationlow-rank factorizationdoubly robust estimatorsmulti-armed banditsconfidence intervalsadaptive sampling
0
0 comments X

The pith

Doubly robust estimators let low-rank predictions speed up valid best-LLM identification

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to combine multi-armed bandit algorithms with low-rank factorization predictions when evaluating large language models. It derives doubly robust estimators that incorporate the predictions to cut variance while keeping the performance estimates unbiased. This matters because full evaluation of many models on many examples costs substantial compute, and the approach allows confident selection of the top model after far fewer calls. The framework specifically handles adaptive model choices that depend on prior results and sampling of examples without replacement.

Core claim

We derive doubly robust estimators of each model's performance that use the low-rank predictions to reduce variance. This enables the construction of valid finite-sample confidence intervals in our setting, where models are selected adaptively and examples are sampled without replacement.

What carries the argument

Doubly robust estimators that blend observed scores with low-rank predicted scores for variance reduction while preserving unbiasedness under adaptive sampling.

If this is right

  • Fewer model-example evaluations suffice to identify the best LLM with statistical confidence
  • Valid finite-sample confidence intervals remain available despite adaptive selection
  • Correct identification holds even if low-rank predictions contain bias
  • Real-world benchmarks show meaningful reductions in compute while selecting the top model accurately

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same doubly robust correction could apply to adaptive evaluation in other matrix-structured settings such as recommender systems
  • Savings scale with how strongly low-rank structure fits the score matrix of a given benchmark
  • One could test the method by replacing low-rank predictions with other cheap estimators and verifying interval coverage

Load-bearing premise

The doubly robust estimators remain unbiased and achieve correct coverage even when low-rank predictions are biased and under adaptive model selection with sampling without replacement.

What would settle it

Run repeated trials of the adaptive evaluation protocol on a fixed benchmark and check whether the constructed confidence intervals cover the true model performances at the nominal rate.

Figures

Figures reproduced from arXiv: 2605.10405 by Elad Tolochinsky, Yaniv Romano, Yaniv Tenzer.

Figure 1
Figure 1. Figure 1: Budget needed for 95% best￾model identification accuracy (lower is bet￾ter). Bench 1 and 2 cover 2.2K models each. ∆top2, the gap between the top two models, controls difficulty. See Section 4. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy versus budget. Shaded areas around each curve depict standard error. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effective Budget 21 [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy versus budget. Shaded areas around each curve depict standard error. [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
read the original abstract

Selecting the best large language model (LLM) for a fixed benchmark is often expensive, since exhaustive evaluation requires running every model on every example. Multi-armed bandit (MAB) algorithms can reduce the number of LLM calls by sequentially selecting the next model-example pair to evaluate, thereby avoiding wasted evaluations on clearly underperforming models. Further savings can be achieved by predicting model scores from the partially observed model-example score matrix using low-rank factorization. However, such predictions are not ground truth: they can be biased and may therefore lead to incorrect identification of the best model. In this work, we propose a principled framework that combines MAB with cheap predicted scores without compromising statistical validity. Specifically, we derive doubly robust estimators of each model's performance that use the low-rank predictions to reduce variance. This enables the construction of valid finite-sample confidence intervals in our setting, where models are selected adaptively and examples are sampled without replacement. Empirical results on real-world benchmarks show that our approach reduces the number of required evaluations, yielding meaningful savings in compute and cost while accurately identifying the best-performing model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes combining multi-armed bandit (MAB) algorithms with low-rank factorization of the partially observed model-example score matrix to reduce the number of LLM evaluations needed for best-model identification. It derives doubly robust estimators for each model's mean performance that incorporate the low-rank predictions for variance reduction, and claims these estimators yield valid finite-sample confidence intervals despite adaptive model selection and sampling of examples without replacement. Empirical results on real benchmarks are said to show meaningful reductions in evaluations while still correctly identifying the best model.

Significance. If the doubly robust estimators can be shown to remain unbiased and deliver correct finite-sample coverage under adaptive MAB selection and without-replacement sampling (even when the low-rank predictions are biased), the work would offer a practical advance for statistically valid, compute-efficient LLM benchmarking. The approach directly targets the high cost of exhaustive evaluation while addressing the risk of biased predictions leading to incorrect model selection.

major comments (2)
  1. [derivation of doubly robust estimators (abstract and main technical sections)] The central claim that the derived doubly robust estimators remain unbiased and produce valid finite-sample CIs under adaptive MAB selection plus without-replacement sampling is load-bearing, yet the manuscript provides no explicit derivation, no statement of the required assumptions on the propensity scores, and no proof that the estimator accounts for the martingale dependence (past outcomes affect future sampling probabilities). Standard DR unbiasedness does not automatically extend to this setting if propensities are treated as fixed.
  2. [confidence interval construction] The finite-population correction for sampling without replacement must be incorporated into the variance estimator and CI construction; it is unclear whether the proposed intervals include this correction or whether the coverage guarantee holds only asymptotically.
minor comments (2)
  1. [experiments] The abstract states that 'empirical savings are claimed but not quantified'; the experimental section should report concrete numbers for evaluation reduction, coverage rates, and identification accuracy across multiple benchmarks and random seeds.
  2. [preliminaries and method] Notation for the low-rank factorization, the MAB policy, and the DR estimator should be introduced with explicit definitions and distinguished from standard DR notation to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments have prompted us to strengthen the technical exposition and clarify key aspects of the finite-sample guarantees. We respond to each major comment below and indicate the revisions made.

read point-by-point responses
  1. Referee: [derivation of doubly robust estimators (abstract and main technical sections)] The central claim that the derived doubly robust estimators remain unbiased and produce valid finite-sample CIs under adaptive MAB selection plus without-replacement sampling is load-bearing, yet the manuscript provides no explicit derivation, no statement of the required assumptions on the propensity scores, and no proof that the estimator accounts for the martingale dependence (past outcomes affect future sampling probabilities). Standard DR unbiasedness does not automatically extend to this setting if propensities are treated as fixed.

    Authors: We agree that greater explicitness is warranted. While Appendix B contained a derivation, it was insufficiently cross-referenced and did not fully address the martingale structure. We have revised the manuscript by expanding the main technical section (now Section 3.2) to include the complete derivation of the doubly robust estimator. We explicitly state Assumption 1 on the propensity scores (they are known and determined by the realized history of the adaptive MAB policy) and add Lemma 1, which shows that the estimator is a martingale difference sequence with respect to the natural filtration. Unbiasedness then follows from the optional stopping theorem, extending standard DR results to this dependent setting. The low-rank predictions enter only as an auxiliary model and do not affect unbiasedness provided the propensities are correctly specified. revision: yes

  2. Referee: [confidence interval construction] The finite-population correction for sampling without replacement must be incorporated into the variance estimator and CI construction; it is unclear whether the proposed intervals include this correction or whether the coverage guarantee holds only asymptotically.

    Authors: We thank the referee for this important clarification request. The variance estimator in Equation (8) does incorporate the finite-population correction factor (N-n)/(N-1), where N is the total number of examples. Theorem 2 establishes exact finite-sample coverage (not merely asymptotic) by combining the unbiasedness of the DR estimator with the exact hypergeometric-style variance under without-replacement sampling, adjusted for the adaptive policy via the martingale property. To improve clarity we have added an explicit remark in Section 4.2 describing the correction term and its role in the coverage proof. We have also included a brief finite-sample coverage verification in the appendix. revision: partial

Circularity Check

0 steps flagged

Derivation of doubly robust estimators remains self-contained

full rationale

The paper's core contribution is the derivation of doubly robust estimators that incorporate low-rank predictions for variance reduction while preserving unbiasedness and finite-sample CI validity under adaptive MAB selection and without-replacement sampling. No quoted equations or steps reduce the claimed validity result to a tautology, a fitted parameter renamed as a prediction, or a load-bearing self-citation chain. The derivation is presented as an adaptation of standard DR theory to the specific protocol, with the low-rank component used only for efficiency rather than as a definitional input. This is the most common honest finding for a methods paper whose central claim is an estimator construction rather than a re-expression of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; standard statistical assumptions for doubly robust estimation and low-rank matrix models are implicitly required but not stated.

pith-pipeline@v0.9.0 · 5492 in / 1145 out tokens · 53676 ms · 2026-05-12T05:01:03.882959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 6 internal anchors

  1. [1]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

  2. [2]

    A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35 (2):1–72, 2026

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35 (2):1–72, 2026

  3. [3]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

  4. [4]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

  5. [5]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

  6. [6]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,International Conference on Learning Representations, volume 2024, pages 9025–9049, 2024. URL https://proceedings.iclr.cc/paper_files/paper/ 2024/fil...

  7. [7]

    Efficient benchmarking of AI agents

    Franck Ndzomga. Efficient benchmarking of ai agents.arXiv preprint arXiv:2603.23749, 2026

  8. [8]

    Holistic agent leaderboard: The missing infrastructure for AI agent evaluation.arXiv preprint arXiv:2510.11977, 2025

    Sayash Kapoor et al. Holistic agent leaderboard: The missing infrastructure for AI agent evaluation.arXiv preprint arXiv:2510.11977, 2025

  9. [9]

    Best arm identification in multi-armed bandits

    Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010, pages 13–p, 2010

  10. [10]

    On speeding up language model evaluation

    Jin Peng Zhou, Christian K Belardi, Ruihan Wu, Travis Zhang, Carla P Gomes, Wen Sun, and Kilian Q Weinberger. On speeding up language model evaluation. InThe Thirteenth International Conference on Learning Representations, 2025

  11. [11]

    Semiparametric efficiency in multivariate regression models with missing data.Journal of the American Statistical Association, 90(429):122–129, 1995

    James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data.Journal of the American Statistical Association, 90(429):122–129, 1995

  12. [12]

    Springer, 2006

    Anastasios A Tsiatis.Semiparametric theory and missing data. Springer, 2006

  13. [13]

    Prediction-powered inference.Science, 382(6671):669–674, 2023

    Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction-powered inference.Science, 382(6671):669–674, 2023

  14. [14]

    arXiv preprint arXiv:2311.01453 , year=

    Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. PPI++: Efficient prediction- powered inference.arXiv preprint arXiv:2311.01453, 2023. 10

  15. [15]

    Confidence intervals for policy evaluation in adaptive experiments.Proceedings of the national academy of sciences, 118(15):e2014602118, 2021

    Vitor Hadad, David A Hirshberg, Ruohan Zhan, Stefan Wager, and Susan Athey. Confidence intervals for policy evaluation in adaptive experiments.Proceedings of the national academy of sciences, 118(15):e2014602118, 2021

  16. [16]

    Doubly robust policy evaluation and learning

    Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601, 2011

  17. [17]

    Optimal and adaptive off-policy evalua- tion in contextual bandits

    Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. Optimal and adaptive off-policy evalua- tion in contextual bandits. InInternational Conference on Machine Learning, pages 3589–3597. PMLR, 2017

  18. [18]

    Online multi-armed bandits with adaptive inference.Advances in Neural Information Processing Systems, 34:1939–1951, 2021

    Maria Dimakopoulou, Zhimei Ren, and Zhengyuan Zhou. Online multi-armed bandits with adaptive inference.Advances in Neural Information Processing Systems, 34:1939–1951, 2021

  19. [19]

    The adaptive doubly robust estimator and a paradox concerning logging policy.Advances in neural information processing systems, 34:1351–1364, 2021

    Masahiro Kato, Kenichiro McAlinn, and Shota Yasui. The adaptive doubly robust estimator and a paradox concerning logging policy.Advances in neural information processing systems, 34:1351–1364, 2021

  20. [20]

    Post-contextual-bandit inference.Advances in neural information processing systems, 34:28548–28559, 2021

    Aurélien Bibaut, Maria Dimakopoulou, Nathan Kallus, Antoine Chambaz, and Mark van Der Laan. Post-contextual-bandit inference.Advances in neural information processing systems, 34:28548–28559, 2021

  21. [21]

    Off-policy evaluation via adaptive weighting with data from contextual bandits

    Ruohan Zhan, Vitor Hadad, David A Hirshberg, and Susan Athey. Off-policy evaluation via adaptive weighting with data from contextual bandits. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2125–2135, 2021

  22. [22]

    Doubly-robust lasso bandit.Advances in Neural Information Processing Systems, 32, 2019

    Gi-Soo Kim and Myunghee Cho Paik. Doubly-robust lasso bandit.Advances in Neural Information Processing Systems, 32, 2019

  23. [23]

    Doubly robust thompson sampling with linear payoffs.Advances in neural information processing systems, 34:15830–15840, 2021

    Wonyoung Kim, Gi-Soo Kim, and Myunghee Cho Paik. Doubly robust thompson sampling with linear payoffs.Advances in neural information processing systems, 34:15830–15840, 2021

  24. [24]

    Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

    Wenlong Ji, Yihan Pan, Ruihao Zhu, and Lihua Lei. Multi-armed bandits with machine learning-generated surrogate rewards.arXiv preprint arXiv:2506.16658, 2025

  25. [25]

    Best arm identification with llm judges and limited human audits.Available at SSRN 6147806, 2026

    Ruicheng Ao, Hongyu Chen, Siyang Gao, Hanwei Li, and David Simchi-Levi. Best arm identification with llm judges and limited human audits.Available at SSRN 6147806, 2026

  26. [26]

    Efficient Evaluation of LLM Performance with Statistical Guarantees

    Skyler Wu, Yash Nair, and Emmanuel J Candés. Efficient evaluation of llm performance with statistical guarantees.arXiv preprint arXiv:2601.20251, 2026

  27. [27]

    Concentration inequalities for sampling without replacement.Bernoulli, 2015

    Rémi Bardenet and Odalric-Ambrym Maillard. Concentration inequalities for sampling without replacement.Bernoulli, 2015

  28. [28]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, and Quoc V Le. H chi, denny zhou, et al. 2022. challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022

  29. [29]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98

  30. [30]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

  31. [31]

    Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023. 11

  32. [32]

    Musr: Testing the limits of chain-of-thought with multistep soft reasoning

    Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,In- ternational Conference on Learning Representations, volume 2024, pages 14670– 14728, 2024. URL https://proceedings.iclr.cc/pap...

  33. [33]

    Regularization paths for generalized linear models via coordinate descent.Journal of statistical software, 33:1–22, 2010

    Jerome H Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent.Journal of statistical software, 33:1–22, 2010

  34. [34]

    On bernstein-type inequalities for martingales.Stochas- tic processes and their applications, 93(1):109–117, 2001

    Kacha Dzhaparidze and JH Van Zanten. On bernstein-type inequalities for martingales.Stochas- tic processes and their applications, 93(1):109–117, 2001

  35. [35]

    Z k i − ¯Z <k i 2 Fk−1 # =E

    David A Freedman. On tail probabilities for martingales.the Annals of Probability, pages 100–118, 1975. A Proofs A.1 Proof of Lemma 1 Lemma 3.Assumeπ k i , λk i and ˆSk ij areF k−1 measurable, then for eachk≥1,E[ ˆθk i | F k−1] =µ i Proof. E[ˆθk i | F k−1] =E   1 n   X j∈O k−1 i Si,j + X j∈U k i λk i ˆSk i,j + Si,jk −λ k i ˆSk i,jk πk i !  Fk−1   ...