Recognition: unknown
Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking
Pith reviewed 2026-05-08 05:23 UTC · model grok-4.3
The pith
A repeated-split protocol with item-level bootstrap corrects optimistic bias in adaptive LLM benchmarking.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a fixed-shortlist regime with smooth stabilized selection, the SIREN estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winner-based reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target.
What carries the argument
SIREN, the selection-aware repeated-split protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification.
If this is right
- Confidence intervals become available for the entire curve of procedure performance versus tuning budget.
- Pre-specified equal-budget and cross-budget comparisons between procedures gain valid simultaneous coverage.
- Winner-based scores can produce optimistic bias large enough to reverse deployment decisions.
- SIREN tracks the held-out finite-sample target closely in both synthetic and real tuning settings.
Where Pith is reading between the lines
- Published results on adaptively tuned LLMs may systematically overstate deployable performance when winner scores are reported without split separation.
- The same repeated-split and bootstrap correction could be applied to other adaptive search settings where data items are reused across tuning steps.
- Extending the item-level representation beyond smooth stabilized selection would allow the method to cover a wider range of practical tuning procedures.
Load-bearing premise
The fixed-shortlist regime with smooth stabilized selection is required for the first-order item-level representation and for the bootstrap to yield valid simultaneous inference; if selection is unstable or the shortlist is not fixed, the theoretical guarantees may not hold.
What would settle it
A controlled simulation or MMLU-Pro-style experiment in which the bootstrap intervals fail to cover the true procedure-level performance at the claimed rate, even under fixed shortlist and smooth selection, would falsify the inference guarantee.
Figures
read the original abstract
Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winner-based reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that adaptive prompt/program search in LLM evaluation induces a winner's curse, so that the observed winner's score on reused items does not estimate the fresh-data performance of the full tune-then-deploy procedure. It introduces SIREN, a repeated-split protocol that freezes the post-search shortlist, separates selection from held-out evaluation, and applies an item-level Gaussian multiplier bootstrap. Under a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation and the bootstrap yields valid simultaneous inference on a finite budget grid; this enables confidence intervals for procedure-performance curves and pre-specified comparisons. Controlled simulations and MMLU-Pro tuning experiments are presented to show that winner-based reporting can be optimistic and alter deployment conclusions while SIREN tracks the finite-sample target.
Significance. If the theoretical guarantees and empirical behavior hold, the work is significant for LLM evaluation practice. It supplies a concrete, selection-aware inference method that can produce valid uncertainty quantification and cross-budget comparisons under explicit tuning budgets, directly addressing a source of over-optimism that affects deployment decisions on benchmarks such as MMLU-Pro. The provision of both simulation controls and real-tuning experiments strengthens the practical relevance.
major comments (2)
- [Abstract and experimental section] Abstract (lines 8-10) and theoretical development: the first-order item-level representation and valid simultaneous bootstrap inference are derived only under the fixed-shortlist regime with smooth stabilized selection. The MMLU-Pro tuning experiments are described at high level without explicit confirmation that the reported procedure freezes the shortlist before held-out evaluation and maintains selection stability across splits; therefore the empirical demonstration that SIREN remains close to the finite-sample target does not test the precise conditions under which the bootstrap coverage is proved.
- [Methods / Experiments] Methods and experimental details: the manuscript provides insufficient information on the exact bootstrap implementation (e.g., number of replicates, multiplier distribution, handling of ties), data exclusion rules, and how the repeated-split protocol is operationalized in both the simulations and the MMLU-Pro runs. These omissions prevent independent verification that the reported coverage properties are obtained under the stated regime.
minor comments (2)
- [Introduction / Notation] Notation for the procedure-level target and the item-level representation could be introduced earlier and used consistently to improve readability.
- [Discussion] The paper would benefit from a short discussion of how the fixed-shortlist assumption relates to common iterative prompt-search pipelines that continue adapting after an initial shortlist is formed.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. These have helped us identify areas where the manuscript can be strengthened for clarity, reproducibility, and alignment between theory and experiments. We provide point-by-point responses below and have revised the manuscript to address the concerns.
read point-by-point responses
-
Referee: [Abstract and experimental section] Abstract (lines 8-10) and theoretical development: the first-order item-level representation and valid simultaneous bootstrap inference are derived only under the fixed-shortlist regime with smooth stabilized selection. The MMLU-Pro tuning experiments are described at high level without explicit confirmation that the reported procedure freezes the shortlist before held-out evaluation and maintains selection stability across splits; therefore the empirical demonstration that SIREN remains close to the finite-sample target does not test the precise conditions under which the bootstrap coverage is proved.
Authors: We agree that the theoretical guarantees apply specifically under the fixed-shortlist regime with smooth stabilized selection, as stated in the abstract and theoretical sections. The MMLU-Pro experiments were designed and run following the SIREN protocol, which includes freezing the post-search shortlist prior to held-out evaluation and applying the same selection procedure across splits to maintain stability. However, we acknowledge that the original description was at too high a level and did not explicitly confirm these aspects. In the revised manuscript, we will add a new subsection in the experimental setup that explicitly states: (i) the shortlist is frozen after selection on each training split before any held-out evaluation occurs, and (ii) selection stability is enforced by using identical selection criteria and hyperparameters across all repeated splits. This will make clear that the reported results operate under the conditions for which coverage is proved, directly addressing the concern. revision: yes
-
Referee: [Methods / Experiments] Methods and experimental details: the manuscript provides insufficient information on the exact bootstrap implementation (e.g., number of replicates, multiplier distribution, handling of ties), data exclusion rules, and how the repeated-split protocol is operationalized in both the simulations and the MMLU-Pro runs. These omissions prevent independent verification that the reported coverage properties are obtained under the stated regime.
Authors: We accept this criticism; the current manuscript indeed omits several implementation specifics that are necessary for full reproducibility and verification. In the revised version, we will expand the Methods section with a dedicated implementation subsection that specifies: the number of bootstrap replicates (1000), the use of standard normal multipliers for the Gaussian multiplier bootstrap, tie-breaking via uniform random selection among tied items, data exclusion rules (no items were excluded beyond standard benchmark preprocessing), and a step-by-step operationalization of the repeated-split protocol for both the controlled simulations (including how splits are generated and selection is performed per split) and the MMLU-Pro runs (including budget grid, shortlist size, and held-out evaluation). These additions will enable independent verification that the experiments adhere to the fixed-shortlist regime with stabilized selection. revision: yes
Circularity Check
No significant circularity; bootstrap and first-order representation are standard tools applied after explicit split separation
full rationale
The derivation chain rests on a conditional theoretical claim: under the explicitly stated fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation and the item-level Gaussian multiplier bootstrap yields valid simultaneous inference. This is a direct application of standard non-parametric bootstrap theory to a pre-separated selection/evaluation split; the target (procedure-level performance) is defined independently of the estimator and is not recovered by construction from fitted parameters or prior self-citations. No self-definitional loops, fitted-input-as-prediction, uniqueness theorems imported from the same authors, or ansatz smuggling appear in the abstract or described protocol. Experiments are presented as empirical checks against the finite-sample target rather than tautological confirmation. The result is therefore self-contained against external statistical benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gaussian multiplier bootstrap yields valid simultaneous inference under the fixed-shortlist regime with smooth stabilized selection
Reference graph
Works this paper leans on
-
[1]
Inference on winners.The Quarterly Journal of Economics, 139(1):305–358, 2024
Isaiah Andrews, Toru Kitagawa, and Adam McCloskey. Inference on winners.The Quarterly Journal of Economics, 139(1):305–358, 2024
2024
-
[2]
arXiv preprint arXiv:2506.07949 , year=
Anastasios N. Angelopoulos, Jacob Eisenstein, Jonathan Berant, Alekh Agarwal, and Adam Fisch. Cost-optimal active AI model evaluation.arXiv preprint arXiv:2506.07949, 2025
-
[3]
Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023
work page internal anchor Pith review arXiv 2023
-
[4]
Bandit-based prompt design strategy selection improves prompt optimizers
Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, and Shinichi Shirakawa. Bandit-based prompt design strategy selection improves prompt optimizers. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20799–20817. Association for Computational Linguistics, 2025
2025
-
[5]
Valid post-selection inference.The Annals of Statistics, 41(2):802–837, 2013
Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. Valid post-selection inference.The Annals of Statistics, 41(2):802–837, 2013
2013
-
[6]
The ladder: A reliable leaderboard for machine learning competitions
Avrim Blum and Moritz Hardt. The ladder: A reliable leaderboard for machine learning competitions. InProceedings of the 32nd International Conference on Machine Learning, pages 1006–1014, 2015
2015
-
[7]
Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786–2819, 2013
Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786–2819, 2013
2013
-
[8]
Preserving statistical validity in adaptive data analysis
Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Preserving statistical validity in adaptive data analysis. InProceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pages 117–126, 2015
2015
-
[9]
The reusable holdout: Preserving validity in adaptive data analysis.Science, 349(6248):636–638, 2015
Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis.Science, 349(6248):636–638, 2015
2015
-
[10]
Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614, 2011
Bradley Efron. Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614, 2011
2011
-
[11]
Improvements on cross-validation: The .632+ bootstrap method.Journal of the American Statistical Association, 92(438):548–560, 1997
Bradley Efron and Robert Tibshirani. Improvements on cross-validation: The .632+ bootstrap method.Journal of the American Statistical Association, 92(438):548–560, 1997
1997
-
[12]
Optimal inference after model selection.arXiv preprint arXiv:1410.2597, 2014
William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. arXiv preprint arXiv:1410.2597, 2014
-
[13]
Riccardo Fogliato, Pratik Patil, Mathew Monfort, and Pietro Perona. A framework for ef- ficient model evaluation through stratification, sampling, and estimation.arXiv preprint arXiv:2406.07320, 2024
-
[14]
Kristina Gligori´c, Tijana Zrnic, Cinoo Lee, Emmanuel Candès, and Dan Jurafsky. Can un- confident LLM annotations be used for confident conclusions? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3514–3533. Association ...
2025
-
[15]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review arXiv 2024
-
[16]
Connecting large language models with evolutionary algorithms yields powerful prompt optimizers
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. InInternational Conference on Learning Representations, 2024
2024
-
[17]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023
work page internal anchor Pith review arXiv 2023
-
[18]
Lee, Dennis L
Jason D. Lee, Dennis L. Sun, Yuekai Sun, and Jonathan E. Taylor. Exact post-selection inference, with application to the lasso.The Annals of Statistics, 44(3):907–927, 2016
2016
-
[19]
ReliableEval: A recipe for stochastic LLM evaluation via method of moments
Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, and Gabriel Stanovsky. ReliableEval: A recipe for stochastic LLM evaluation via method of moments. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 11146–11153. Association for Computational Linguistics, 2025
2025
-
[20]
tinyBenchmarks: evaluating LLMs with fewer examples
Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinyBenchmarks: evaluating LLMs with fewer examples. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 34303–34326. PMLR, 2024
2024
-
[21]
State of what art? a call for multi-prompt LLM evaluation.Transactions of the Association for Computational Linguistics, 12:933–949, 2024
Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. State of what art? a call for multi-prompt LLM evaluation.Transactions of the Association for Computational Linguistics, 12:933–949, 2024
2024
-
[22]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review arXiv 2023
-
[23]
Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab
Krista Opsahl-Ong, Michael J. Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9340–9366. Association for Computational Linguistics, 2024
2024
-
[24]
Efficient multi-prompt evaluation of LLMs
Felipe Maia Polo, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, and Mikhail Yurochkin. Efficient multi-prompt evaluation of LLMs. InAdvances in Neural Information Processing Systems, volume 37, pages 22483–22512, 2024
2024
-
[25]
Efficient prompt optimization through the lens of best arm identification
Chengshuai Shi, Kun Yang, Zihan Chen, Jundong Li, Jing Yang, and Cong Shen. Efficient prompt optimization through the lens of best arm identification. InAdvances in Neural Informa- tion Processing Systems, 2024
2024
-
[26]
Logan IV , Eric Wallace, and Sameer Singh
Taylor Shin, Yasaman Razeghi, Robert L. Logan IV , Eric Wallace, and Sameer Singh. Auto- Prompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, 2020
2020
-
[27]
Selective inference with a randomized response.The Annals of Statistics, 46(2):679–710, 2018
Xiaoying Tian and Jonathan Taylor. Selective inference with a randomized response.The Annals of Statistics, 46(2):679–710, 2018
2018
-
[28]
Anchor points: Benchmarking models with much fewer examples
Rajan Vivek, Kawin Ethayarajh, Diyi Yang, and Douwe Kiela. Anchor points: Benchmarking models with much fewer examples. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1576–1601. Association for Computational Linguistics, 2024
2024
-
[29]
MMLU-Pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, volume 3...
2024
-
[30]
Skyler Wu, Yash Nair, and Emmanuel J. Candès. Efficient evaluation of LLM performance with statistical guarantees.arXiv preprint arXiv:2601.20251, 2026
work page internal anchor Pith review arXiv 2026
-
[31]
Le, Denny Zhou, and Xinyun Chen
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V . Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024
2024
-
[32]
Guanhua Zhang, Florian E. Dorner, and Moritz Hardt. How benchmark prediction from fewer data misses the mark.arXiv preprint arXiv:2506.07673, 2025
-
[33]
Xu-Xiang Zhong, Chao Yi, and Han-Jia Ye. Efficient evaluation of large language models via collaborative filtering.arXiv preprint arXiv:2504.08781, 2025
-
[34]
Belardi, Ruihan Wu, Travis Zhang, Carla P
Jin Peng Zhou, Christian K. Belardi, Ruihan Wu, Travis Zhang, Carla P. Gomes, Wen Sun, and Kilian Q. Weinberger. On speeding up language model evaluation. InInternational Conference on Learning Representations, 2025
2025
-
[35]
Large language models are human-level prompt engineers
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InInternational Conference on Learning Representations, 2023
2023
-
[36]
Zollo, Todd Morrill, Zhun Deng, Jake C
Thomas P. Zollo, Todd Morrill, Zhun Deng, Jake C. Snell, Toniann Pitassi, and Richard Zemel. Prompt risk control: A rigorous framework for responsible deployment of large language models. InInternational Conference on Learning Representations, 2024
2024
-
[37]
inconclusive
Tijana Zrnic and Emmanuel Candes. Active statistical inference. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 62993–63010. PMLR, 2024. 12 A Related works A first line of related work develops methods that use data to search over prompts, demonstrations, and larger langu...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.