arxiv: 2605.05973 · v1 · submitted 2026-05-07 · 📊 stat.ML · cs.AI· cs.LG· stat.AP

Recognition: unknown

Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

Yang Xu , Jiefu Zhang , Haixiang Sun , Zihan Zhou , Tianyu Cao , Vaneet Aggarwal

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:23 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LGstat.AP

keywords winner's curseadaptive benchmarkingLLM evaluationselection biasbootstrap inferencerepeated splitsperformance estimationMMLU-Pro

0 comments

The pith

A repeated-split protocol with item-level bootstrap corrects optimistic bias in adaptive LLM benchmarking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When tuning prompts or programs on reused benchmark items, the top score observed after search overestimates how well the full tune-then-deploy procedure will perform on fresh data. The paper introduces SIREN to fix this by freezing the final shortlist, running selection and evaluation on separate splits, and applying a Gaussian multiplier bootstrap at the item level. Under the condition of a fixed shortlist and smooth selection, the method yields a simple representation that lets the bootstrap produce valid simultaneous intervals for performance curves across budgets. Simulations and MMLU-Pro experiments show that ordinary winner reporting can shift which procedure looks best for deployment, while SIREN stays aligned with the actual target value.

Core claim

In a fixed-shortlist regime with smooth stabilized selection, the SIREN estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winner-based reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target.

What carries the argument

SIREN, the selection-aware repeated-split protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification.

If this is right

Confidence intervals become available for the entire curve of procedure performance versus tuning budget.
Pre-specified equal-budget and cross-budget comparisons between procedures gain valid simultaneous coverage.
Winner-based scores can produce optimistic bias large enough to reverse deployment decisions.
SIREN tracks the held-out finite-sample target closely in both synthetic and real tuning settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Published results on adaptively tuned LLMs may systematically overstate deployable performance when winner scores are reported without split separation.
The same repeated-split and bootstrap correction could be applied to other adaptive search settings where data items are reused across tuning steps.
Extending the item-level representation beyond smooth stabilized selection would allow the method to cover a wider range of practical tuning procedures.

Load-bearing premise

The fixed-shortlist regime with smooth stabilized selection is required for the first-order item-level representation and for the bootstrap to yield valid simultaneous inference; if selection is unstable or the shortlist is not fixed, the theoretical guarantees may not hold.

What would settle it

A controlled simulation or MMLU-Pro-style experiment in which the bootstrap intervals fail to cover the true procedure-level performance at the claimed rate, even under fixed shortlist and smooth selection, would falsify the inference guarantee.

Figures

Figures reproduced from arXiv: 2605.05973 by Haixiang Sun, Jiefu Zhang, Tianyu Cao, Vaneet Aggarwal, Yang Xu, Zihan Zhou.

**Figure 1.** Figure 1: Overview of the SIREN pipeline. (a)–(b) tuning and shortlist freezing; (c) score tensor view at source ↗

**Figure 2.** Figure 2: Study A: Coverage and width across M at R=5. (a) Empirical coverage of the multiplierbootstrap 95% CI. Error bars are ±SE over Nsim=2,000 trials; the grey band is the Monte Carlo 95% tolerance region around the nominal level. (b) Log–log plot of average CI width; dotted line shows the reference M−1/2 slope. 23 view at source ↗

**Figure 3.** Figure 3: Study A: Effect of the number of repeated splits R at fixed M=500, K=10. (a) Coverage is stable across R (all within the MC 95% band). (b) Width contracts quickly up to R=5, with diminishing returns beyond. SIREN matches a nonparametric item-bootstrap at a fraction of the cost. A practitioner could, in principle, bypass the influence-function machinery and instead resample items with replacement and rerun … view at source ↗

**Figure 4.** Figure 4: Study A: SIREN versus a nonparametric item-bootstrap baseline at K=10, R=5. (a) Both methods cover at the nominal level (error bars: ±1.96·SE). (b) CI width ratios are within ±0.8% of unity. The item-bootstrap is ∼ 38× slower because it reruns the full repeated-split pipeline on every resample, while SIREN runs it once and bootstraps a length-M vector. Takeaway. Across 22 trial-configurations totaling more… view at source ↗

**Figure 5.** Figure 5: Study B: Hard-selection coverage fails in the near-tie regime. (a) Empirical coverage vs. quality gap ∆; error bands are ±SE over Nsim=2,000 trials, grey band is the Monte Carlo 95% tolerance region around the nominal level. Hard (orange) drops to 89.8% at ∆=0.20; soft (blue) is stable across the sweep. (b) Winner instability πbwin and the adaptive threshold 0.10. The shaded orange band marks the empirical… view at source ↗

**Figure 6.** Figure 6: Study B: Why hard-selection bootstrap undercovers. The orange shaded area is the variance that the hard influence function ψbhard does not capture: the discontinuous jump of the argmax winner across splits. Soft selection has a non-zero Jacobian and therefore estimates its own variance correctly (blue lines track each other). 26 view at source ↗

**Figure 7.** Figure 7: Study B: Instability-triggered adaptive rule. (a) Coverage of hard, soft, and adaptive across ∆. Adaptive (green) improves on hard near ties but does not fully match soft at boundary points where πbwin ≈ 0.10. (b) Average CI width: adaptive coincides with hard in the stable regime (large ∆) and pulls toward soft near ties, giving a dynamic trade-off between width and coverage. Takeaway. Across 17 gap level… view at source ↗

**Figure 8.** Figure 8: Study C: same-data optimism validation. Left: Optimism bias vs. library size HA at fixed HB=3 (paired item draws). Same-data reporting (red) produces bias growing roughly as σ √ 2 log H/√ M (dashed); the repeated-split estimator (blue) has near-zero bias. Right: When two systems with identical true quality search over unequal libraries (independent item draws), same-data reporting declares the system with … view at source ↗

read the original abstract

Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winner-based reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SIREN gives a repeated-split bootstrap fix for winner's curse in adaptive LLM eval, but the guarantees only cover a fixed-shortlist regime that real tuning rarely matches.

read the letter

The main point is that the authors propose SIREN, a protocol that freezes the post-search shortlist, splits selection from evaluation, and applies an item-level Gaussian multiplier bootstrap to produce uncertainty estimates for the full tune-then-deploy procedure rather than the observed winner score alone. In simulations and MMLU-Pro tuning runs they show that naive winner reporting tends to be optimistic and can reverse deployment conclusions, while SIREN stays closer to the finite-sample target they define.

Referee Report

2 major / 2 minor

Summary. The paper claims that adaptive prompt/program search in LLM evaluation induces a winner's curse, so that the observed winner's score on reused items does not estimate the fresh-data performance of the full tune-then-deploy procedure. It introduces SIREN, a repeated-split protocol that freezes the post-search shortlist, separates selection from held-out evaluation, and applies an item-level Gaussian multiplier bootstrap. Under a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation and the bootstrap yields valid simultaneous inference on a finite budget grid; this enables confidence intervals for procedure-performance curves and pre-specified comparisons. Controlled simulations and MMLU-Pro tuning experiments are presented to show that winner-based reporting can be optimistic and alter deployment conclusions while SIREN tracks the finite-sample target.

Significance. If the theoretical guarantees and empirical behavior hold, the work is significant for LLM evaluation practice. It supplies a concrete, selection-aware inference method that can produce valid uncertainty quantification and cross-budget comparisons under explicit tuning budgets, directly addressing a source of over-optimism that affects deployment decisions on benchmarks such as MMLU-Pro. The provision of both simulation controls and real-tuning experiments strengthens the practical relevance.

major comments (2)

[Abstract and experimental section] Abstract (lines 8-10) and theoretical development: the first-order item-level representation and valid simultaneous bootstrap inference are derived only under the fixed-shortlist regime with smooth stabilized selection. The MMLU-Pro tuning experiments are described at high level without explicit confirmation that the reported procedure freezes the shortlist before held-out evaluation and maintains selection stability across splits; therefore the empirical demonstration that SIREN remains close to the finite-sample target does not test the precise conditions under which the bootstrap coverage is proved.
[Methods / Experiments] Methods and experimental details: the manuscript provides insufficient information on the exact bootstrap implementation (e.g., number of replicates, multiplier distribution, handling of ties), data exclusion rules, and how the repeated-split protocol is operationalized in both the simulations and the MMLU-Pro runs. These omissions prevent independent verification that the reported coverage properties are obtained under the stated regime.

minor comments (2)

[Introduction / Notation] Notation for the procedure-level target and the item-level representation could be introduced earlier and used consistently to improve readability.
[Discussion] The paper would benefit from a short discussion of how the fixed-shortlist assumption relates to common iterative prompt-search pipelines that continue adapting after an initial shortlist is formed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. These have helped us identify areas where the manuscript can be strengthened for clarity, reproducibility, and alignment between theory and experiments. We provide point-by-point responses below and have revised the manuscript to address the concerns.

read point-by-point responses

Referee: [Abstract and experimental section] Abstract (lines 8-10) and theoretical development: the first-order item-level representation and valid simultaneous bootstrap inference are derived only under the fixed-shortlist regime with smooth stabilized selection. The MMLU-Pro tuning experiments are described at high level without explicit confirmation that the reported procedure freezes the shortlist before held-out evaluation and maintains selection stability across splits; therefore the empirical demonstration that SIREN remains close to the finite-sample target does not test the precise conditions under which the bootstrap coverage is proved.

Authors: We agree that the theoretical guarantees apply specifically under the fixed-shortlist regime with smooth stabilized selection, as stated in the abstract and theoretical sections. The MMLU-Pro experiments were designed and run following the SIREN protocol, which includes freezing the post-search shortlist prior to held-out evaluation and applying the same selection procedure across splits to maintain stability. However, we acknowledge that the original description was at too high a level and did not explicitly confirm these aspects. In the revised manuscript, we will add a new subsection in the experimental setup that explicitly states: (i) the shortlist is frozen after selection on each training split before any held-out evaluation occurs, and (ii) selection stability is enforced by using identical selection criteria and hyperparameters across all repeated splits. This will make clear that the reported results operate under the conditions for which coverage is proved, directly addressing the concern. revision: yes
Referee: [Methods / Experiments] Methods and experimental details: the manuscript provides insufficient information on the exact bootstrap implementation (e.g., number of replicates, multiplier distribution, handling of ties), data exclusion rules, and how the repeated-split protocol is operationalized in both the simulations and the MMLU-Pro runs. These omissions prevent independent verification that the reported coverage properties are obtained under the stated regime.

Authors: We accept this criticism; the current manuscript indeed omits several implementation specifics that are necessary for full reproducibility and verification. In the revised version, we will expand the Methods section with a dedicated implementation subsection that specifies: the number of bootstrap replicates (1000), the use of standard normal multipliers for the Gaussian multiplier bootstrap, tie-breaking via uniform random selection among tied items, data exclusion rules (no items were excluded beyond standard benchmark preprocessing), and a step-by-step operationalization of the repeated-split protocol for both the controlled simulations (including how splits are generated and selection is performed per split) and the MMLU-Pro runs (including budget grid, shortlist size, and held-out evaluation). These additions will enable independent verification that the experiments adhere to the fixed-shortlist regime with stabilized selection. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bootstrap and first-order representation are standard tools applied after explicit split separation

full rationale

The derivation chain rests on a conditional theoretical claim: under the explicitly stated fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation and the item-level Gaussian multiplier bootstrap yields valid simultaneous inference. This is a direct application of standard non-parametric bootstrap theory to a pre-separated selection/evaluation split; the target (procedure-level performance) is defined independently of the estimator and is not recovered by construction from fitted parameters or prior self-citations. No self-definitional loops, fitted-input-as-prediction, uniqueness theorems imported from the same authors, or ansatz smuggling appear in the abstract or described protocol. Experiments are presented as empirical checks against the finite-sample target rather than tautological confirmation. The result is therefore self-contained against external statistical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard bootstrap validity conditions and the fixed-shortlist smooth-selection regime; no new free parameters or invented entities are introduced beyond the protocol itself.

axioms (1)

domain assumption Gaussian multiplier bootstrap yields valid simultaneous inference under the fixed-shortlist regime with smooth stabilized selection
Invoked to justify uncertainty quantification and confidence intervals for procedure-performance curves.

pith-pipeline@v0.9.0 · 5478 in / 1224 out tokens · 40926 ms · 2026-05-08T05:23:28.553811+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 10 canonical work pages · 5 internal anchors

[1]

Inference on winners.The Quarterly Journal of Economics, 139(1):305–358, 2024

Isaiah Andrews, Toru Kitagawa, and Adam McCloskey. Inference on winners.The Quarterly Journal of Economics, 139(1):305–358, 2024

2024
[2]

arXiv preprint arXiv:2506.07949 , year=

Anastasios N. Angelopoulos, Jacob Eisenstein, Jonathan Berant, Alekh Agarwal, and Adam Fisch. Cost-optimal active AI model evaluation.arXiv preprint arXiv:2506.07949, 2025

work page arXiv 2025
[3]

PaLM 2 Technical Report

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review arXiv 2023
[4]

Bandit-based prompt design strategy selection improves prompt optimizers

Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, and Shinichi Shirakawa. Bandit-based prompt design strategy selection improves prompt optimizers. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20799–20817. Association for Computational Linguistics, 2025

2025
[5]

Valid post-selection inference.The Annals of Statistics, 41(2):802–837, 2013

Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. Valid post-selection inference.The Annals of Statistics, 41(2):802–837, 2013

2013
[6]

The ladder: A reliable leaderboard for machine learning competitions

Avrim Blum and Moritz Hardt. The ladder: A reliable leaderboard for machine learning competitions. InProceedings of the 32nd International Conference on Machine Learning, pages 1006–1014, 2015

2015
[7]

Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786–2819, 2013

Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786–2819, 2013

2013
[8]

Preserving statistical validity in adaptive data analysis

Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Preserving statistical validity in adaptive data analysis. InProceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pages 117–126, 2015

2015
[9]

The reusable holdout: Preserving validity in adaptive data analysis.Science, 349(6248):636–638, 2015

Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis.Science, 349(6248):636–638, 2015

2015
[10]

Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614, 2011

Bradley Efron. Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614, 2011

2011
[11]

Improvements on cross-validation: The .632+ bootstrap method.Journal of the American Statistical Association, 92(438):548–560, 1997

Bradley Efron and Robert Tibshirani. Improvements on cross-validation: The .632+ bootstrap method.Journal of the American Statistical Association, 92(438):548–560, 1997

1997
[12]

Optimal inference after model selection.arXiv preprint arXiv:1410.2597, 2014

William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. arXiv preprint arXiv:1410.2597, 2014

work page arXiv 2014
[13]

A framework for ef- ficient model evaluation through stratification, sampling, and estimation.arXiv preprint arXiv:2406.07320, 2024

Riccardo Fogliato, Pratik Patil, Mathew Monfort, and Pietro Perona. A framework for ef- ficient model evaluation through stratification, sampling, and estimation.arXiv preprint arXiv:2406.07320, 2024

work page arXiv 2024
[14]

Kristina Gligori´c, Tijana Zrnic, Cinoo Lee, Emmanuel Candès, and Dan Jurafsky. Can un- confident LLM annotations be used for confident conclusions? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3514–3533. Association ...

2025
[15]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review arXiv 2024
[16]

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. InInternational Conference on Learning Representations, 2024

2024
[17]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review arXiv 2023
[18]

Lee, Dennis L

Jason D. Lee, Dennis L. Sun, Yuekai Sun, and Jonathan E. Taylor. Exact post-selection inference, with application to the lasso.The Annals of Statistics, 44(3):907–927, 2016

2016
[19]

ReliableEval: A recipe for stochastic LLM evaluation via method of moments

Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, and Gabriel Stanovsky. ReliableEval: A recipe for stochastic LLM evaluation via method of moments. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 11146–11153. Association for Computational Linguistics, 2025

2025
[20]

tinyBenchmarks: evaluating LLMs with fewer examples

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinyBenchmarks: evaluating LLMs with fewer examples. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 34303–34326. PMLR, 2024

2024
[21]

State of what art? a call for multi-prompt LLM evaluation.Transactions of the Association for Computational Linguistics, 12:933–949, 2024

Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. State of what art? a call for multi-prompt LLM evaluation.Transactions of the Association for Computational Linguistics, 12:933–949, 2024

2024
[22]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review arXiv 2023
[23]

Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab

Krista Opsahl-Ong, Michael J. Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9340–9366. Association for Computational Linguistics, 2024

2024
[24]

Efficient multi-prompt evaluation of LLMs

Felipe Maia Polo, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, and Mikhail Yurochkin. Efficient multi-prompt evaluation of LLMs. InAdvances in Neural Information Processing Systems, volume 37, pages 22483–22512, 2024

2024
[25]

Efficient prompt optimization through the lens of best arm identification

Chengshuai Shi, Kun Yang, Zihan Chen, Jundong Li, Jing Yang, and Cong Shen. Efficient prompt optimization through the lens of best arm identification. InAdvances in Neural Informa- tion Processing Systems, 2024

2024
[26]

Logan IV , Eric Wallace, and Sameer Singh

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV , Eric Wallace, and Sameer Singh. Auto- Prompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, 2020

2020
[27]

Selective inference with a randomized response.The Annals of Statistics, 46(2):679–710, 2018

Xiaoying Tian and Jonathan Taylor. Selective inference with a randomized response.The Annals of Statistics, 46(2):679–710, 2018

2018
[28]

Anchor points: Benchmarking models with much fewer examples

Rajan Vivek, Kawin Ethayarajh, Diyi Yang, and Douwe Kiela. Anchor points: Benchmarking models with much fewer examples. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1576–1601. Association for Computational Linguistics, 2024

2024
[29]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, volume 3...

2024
[30]

Skyler Wu, Yash Nair, and Emmanuel J. Candès. Efficient evaluation of LLM performance with statistical guarantees.arXiv preprint arXiv:2601.20251, 2026

work page internal anchor Pith review arXiv 2026
[31]

Le, Denny Zhou, and Xinyun Chen

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V . Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024

2024
[32]

Dorner, and Moritz Hardt

Guanhua Zhang, Florian E. Dorner, and Moritz Hardt. How benchmark prediction from fewer data misses the mark.arXiv preprint arXiv:2506.07673, 2025

work page arXiv 2025
[33]

Efficient evaluation of large language models via collaborative filtering.arXiv preprint arXiv:2504.08781, 2025

Xu-Xiang Zhong, Chao Yi, and Han-Jia Ye. Efficient evaluation of large language models via collaborative filtering.arXiv preprint arXiv:2504.08781, 2025

work page arXiv 2025
[34]

Belardi, Ruihan Wu, Travis Zhang, Carla P

Jin Peng Zhou, Christian K. Belardi, Ruihan Wu, Travis Zhang, Carla P. Gomes, Wen Sun, and Kilian Q. Weinberger. On speeding up language model evaluation. InInternational Conference on Learning Representations, 2025

2025
[35]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InInternational Conference on Learning Representations, 2023

2023
[36]

Zollo, Todd Morrill, Zhun Deng, Jake C

Thomas P. Zollo, Todd Morrill, Zhun Deng, Jake C. Snell, Toniann Pitassi, and Richard Zemel. Prompt risk control: A rigorous framework for responsible deployment of large language models. InInternational Conference on Learning Representations, 2024

2024
[37]

inconclusive

Tijana Zrnic and Emmanuel Candes. Active statistical inference. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 62993–63010. PMLR, 2024. 12 A Related works A first line of related work develops methods that use data to search over prompts, demonstrations, and larger langu...

2024