pith. machine review for the scientific record. sign in

arxiv: 2605.11209 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:53 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM reliability evaluationrare event samplingcross-entropy methodfive-nines reliabilitysample-efficient testingsaturated benchmarksfailure probability estimationparameterized templates
0
0 comments X

The pith

A learned sampling distribution focused on failure-prone inputs estimates five-nines LLM reliability with up to 156 times fewer model calls than uniform sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Benchmarks that show near-perfect LLM accuracy mask the rare failures that matter for safety-critical uses, where even small differences in reliability produce large gaps in actual errors. Standard Monte Carlo sampling makes estimating probabilities at the 99.999 percent level impractical because it requires too many inferences. The paper observes that failures are not scattered randomly but cluster in predictable ways across parameterized input spaces. It therefore uses an iterative optimization procedure to learn a sampling distribution that concentrates queries on those risky regions. The resulting estimates distinguish models that look identical on conventional tests and make extreme reliability measurable under realistic compute limits.

Core claim

LLM failures exhibit strong systematic patterns across broad parameterized input spaces, so the cross-entropy method can iteratively learn a sampling distribution concentrated on failure-prone inputs. When applied to GSM8K templates with three LLMs, the approach produces failure-rate estimates with tight confidence bounds using up to 156.22 times fewer inferences than naive uniform sampling, and it shows that models with indistinguishable accuracy on standard benchmarks can differ substantially in estimated failure rates.

What carries the argument

The cross-entropy method used to learn an adaptive sampling distribution over parameterized inputs that concentrates evaluations where the LLM is most likely to fail.

If this is right

  • Models that match on standard accuracy benchmarks can still be ranked by their estimated failure rates at the five-nines level.
  • Extreme reliability becomes quantifiable for LLMs without requiring prohibitive numbers of inferences.
  • Reliability emerges as a measurable and separable dimension of model quality beyond accuracy on saturated tests.
  • The framework makes routine evaluation of 99.999 percent reliability feasible for reliability-sensitive applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same concentration strategy could be tested on other families of parameterized tasks such as code generation or instruction following to check whether the sample savings generalize.
  • If the systematic failure patterns persist across model scales, the method might let developers track reliability improvements during training with far less compute than full Monte Carlo runs.
  • The approach implies that future benchmarks should include parameterized templates by default so that tail behavior can be measured efficiently rather than relying on fixed test sets.
  • Deployment pipelines could incorporate periodic re-estimation of failure rates using the learned distributions to detect when a model's reliability degrades over time.

Load-bearing premise

LLM failures cluster strongly enough in certain regions of parameterized input spaces that an iterative optimization routine can learn an effective distribution focused on those regions.

What would settle it

Applying the method to a fresh set of parameterized templates or a new model family and finding that the reduction in required inferences stays below 10x or that the estimated failure rates fail to separate models with matched accuracy would show the claimed efficiency and distinguishing power do not hold.

Figures

Figures reproduced from arXiv: 2605.11209 by Chenchen Gu, Eungyeup Kim, J. Zico Kolter, Vashisth Tiwari.

Figure 1
Figure 1. Figure 1: Each solid pink lines connects estimations with varying number of inferences [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of failures across parame￾ter values for model–template pairs. Each bar represents the proportion of failures for each parameter value, and the dotted line shows the expected uniform error rate. Model ID Distance Top1 Top2 Top3 Qwen2.5-Math -7B-Instruct (K = 16) 0 0.900 0.750 0.509 1 0.914 0.833 0.750 2 0.538 0.249 0.019 3 0.472 0.362 0.310 4 0.339 0.264 0.185 5 0.940 0.867 0.800 6 0.879 0.752… view at source ↗
Figure 3
Figure 3. Figure 3: Example pairs of prompt-generations that illustrate repetitive failures by LLMs. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Upper row: CI width vs. number of inferences. Each solid pink lines connects estimations with varying number of inferences (pink dots) under importance sampling with same number of overhead inferences for CEM (N), and darker pink means larger N. Similarly, solid blue line for uniform sampling. The Pareto-frontier (dotted line) is the lower￾left envelope across all (N, # evaluation) combinations, which achi… view at source ↗
Figure 5
Figure 5. Figure 5: Estimated failure probability µˆQ across templates and models at λ = 0.1 and evaluation size of 1M, with error bars denoting the half-width of the confidence interval. For those with zero failures, we report the error bars of exact binomial bounds, i.e., (0, 5.30e-06). and smaller efficiency gains ( [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Parameter-wise failure histograms reveal strong concentration of errors. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qwen2.5-Math-7B-Instruct, K = 4, λ = 0.1 10 0 10 1 10 2 RSE (%) 10 5 10 6 10 7 # Inferences 10 4 10 3 10 2 CI Width 0 Uniform Importance (larger N = darker) Pareto Frontier 10 0 10 1 10 2 RSE (%) 10 6 # Inferences 10 4 10 3 10 2 CI Width 1 10 1 10 2 RSE (%) 10 4 10 5 10 6 # Inferences 10 3 10 2 CI Width 2 10 1 10 2 RSE (%) 10 4 10 5 10 6 # Inferences 10 3 10 2 CI Width 3 10 1 10 2 RSE (%) 10 4 10 5 10 6 # … view at source ↗
Figure 8
Figure 8. Figure 8: Qwen2.5-Math-7B-Instruct, K = 8, λ = 0.1 10 0 10 1 10 2 RSE (%) 10 5 10 6 10 7 # Inferences 10 4 10 3 10 2 CI Width 0 Uniform Importance (larger N = darker) Pareto Frontier 10 0 10 1 10 2 RSE (%) 10 5 10 6 10 7 # Inferences 10 4 10 3 10 2 CI Width 1 10 1 10 2 RSE (%) 10 5 10 6 10 7 # Inferences 10 3 10 2 CI Width 2 10 1 10 2 RSE (%) 10 4 10 5 10 6 # Inferences 10 3 10 2 CI Width 3 10 1 10 2 RSE (%) 10 5 10… view at source ↗
Figure 9
Figure 9. Figure 9: Qwen2.5-Math-7B-Instruct, K = 16, λ = 0.1 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: gpt-oss-20b-low, K = 8, λ = 0.1 10 0 10 1 10 2 RSE (%) 10 5 10 6 # Inferences 10 4 10 3 10 2 CI Width 0 Uniform Importance (larger N = darker) Pareto Frontier 10 1 10 2 RSE (%) 10 4 10 5 10 6 # Inferences 10 3 10 2 CI Width 2 10 1 10 2 RSE (%) 10 4 10 5 10 6 # Inferences 10 3 10 2 CI Width 4 10 0 10 1 10 2 RSE (%) 10 6 # Inferences 10 4 10 3 10 2 CI Width 7 10 0 10 1 10 2 RSE (%) 10 4 10 5 10 6 # Inferenc… view at source ↗
Figure 11
Figure 11. Figure 11: gpt-oss-20b-low, [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: gpt-oss-20b-low, K = 24, λ = 0.1 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Gemini 2.5 Flash Lite, K = 4, λ = 0.1 10 0 10 1 10 2 RSE (%) 10 4 10 5 10 6 # Inferences 10 4 10 3 10 2 CI Width 0 Uniform Importance (larger N = darker) Pareto Frontier 10 1 10 2 RSE (%) 10 5 10 7 # Inferences 10 4 10 3 10 2 CI Width 2 10 1 10 2 RSE (%) 10 5 10 7 # Inferences 10 3 10 2 CI Width 4 10 0 10 1 10 2 RSE (%) 10 4 10 5 10 6 # Inferences 10 4 10 3 10 2 CI Width 6 0 50 100 Sample sets 6 7 8 CI 1e… view at source ↗
Figure 14
Figure 14. Figure 14: Gemini 2.5 Flash Lite, K = 8, λ = 0.1 10 0 10 1 10 2 RSE (%) 10 5 10 7 # Inferences 10 4 10 3 10 2 CI Width 0 Uniform Importance (larger N = darker) Pareto Frontier 10 1 10 2 RSE (%) 10 5 10 7 # Inferences 10 4 10 3 10 2 CI Width 2 10 1 10 2 RSE (%) 10 5 10 7 # Inferences 10 3 10 2 CI Width 4 10 0 10 1 10 2 RSE (%) 10 5 10 7 # Inferences 10 4 10 3 10 2 CI Width 6 0 50 100 Sample sets 3.0 3.5 4.0 CI 1e 4 Q… view at source ↗
Figure 15
Figure 15. Figure 15: Gemini 2.5 Flash Lite, K = 16, λ = 0.1 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Eight independently generated outputs from Qwen2.5-Math-7B-Instruct evalu [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Eight independently generated outputs from Qwen2.5-Math-7B-Instruct evalu [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Eight independently generated outputs from Qwen2.5-Math-7B-Instruct evalu [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: First four of eight independently generated outputs from Qwen2.5-Math-7B [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Seven of eight independently generated outputs from Qwen2.5-Math-7B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Eight independently generated outputs from gpt-oss-20b-low evaluated on [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Eight independently generated outputs from gpt-oss-20b-low evaluated on [PITH_FULL_IMAGE:figures/full_fig_p032_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Seven of eight independently generated outputs from gpt-oss-20b-low evaluated [PITH_FULL_IMAGE:figures/full_fig_p033_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Eight generated outputs from gpt-oss-20b-low evaluated on Template 8. Error [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Eight generated outputs from Gemini 2.5 Flash Lite evaluated on Template 0. [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Eight generated outputs from Gemini 2.5 Flash Lite evaluated on Template 2. [PITH_FULL_IMAGE:figures/full_fig_p036_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: First five of eight independently generated outputs from Gemini 2.5 Flash Lite [PITH_FULL_IMAGE:figures/full_fig_p037_27.png] view at source ↗
read the original abstract

While existing benchmarks demonstrate the near-perfect performance of large language models (LLMs) on various tasks, this apparent saturation often obscures the need for rigorous evaluation of their reliability. In real-world deployment, however, achieving extremely high reliability (e.g., "five-nines" (99.999%) vs. "three-nines" (99.9%)) is fundamentally critical, as this gap results in an order-of-magnitude increase in failures, which is catastrophic in reliability-critical applications. Still, estimating such a rare failure probability with tight confidence bounds requires prohibitively large LLM inference sizes, making standard Monte Carlo evaluation infeasible under limited compute budgets. In this paper, we observe that LLM failures exhibit strong systematic patterns: across broad parameterized input spaces, a small subset of inputs disproportionately accounts for the majority of failures. Leveraging this observation, we propose to learn a sampling distribution concentrated on failure-prone inputs via the cross-entropy method (CEM). We evaluate our framework on three LLMs, Qwen2.5-Math-7B-Instruct, gpt-oss-20b-low, and Gemini 2.5 Flash Lite, across parameterized GSM8K templates and achieve up to 156.22x reduction in required inferences compared to naive uniform sampling. Our estimates reveal that models with indistinguishable accuracy on standard benchmarks can differ substantially in estimated failure rates, underscoring that reliability is a distinct and measurable axis of model quality. Our simple yet practical framework enables the evaluation of extreme reliability in LLMs, a distinct and underexplored dimension of evaluation beyond existing benchmarks, for their growing use in reliability-sensitive applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLM failures on parameterized input spaces (e.g., GSM8K templates) exhibit systematic patterns that can be exploited by the cross-entropy method (CEM) to learn an importance-sampling distribution concentrated on failure-prone inputs. This yields up to 156.22× fewer LLM inferences than uniform Monte Carlo sampling while producing tight estimates of rare failure rates (five-nines regime) and revealing reliability differences among models that are indistinguishable on standard accuracy metrics. Experiments are reported on Qwen2.5-Math-7B-Instruct, gpt-oss-20b-low, and Gemini 2.5 Flash Lite.

Significance. If the empirical speedup and reliability distinctions hold under proper validation, the work supplies a concrete, sample-efficient protocol for quantifying extreme reliability—an axis that standard saturated benchmarks cannot resolve. The observation that accuracy and failure-rate estimates diverge is a useful corrective for deployment decisions in reliability-critical settings. The approach is simple enough to be adopted if the CEM parameterization and overhead accounting are made fully reproducible.

major comments (3)
  1. [§3] §3 (CEM procedure): the parameterization of the sampling distribution over template variables is not specified (e.g., whether it is a product of independent categorical distributions, a neural density estimator, or a mixture), nor are the initial distribution, number of CEM iterations, or elite-sample fraction given. These choices directly affect both the bias of the failure-rate estimator and the claimed reduction factor.
  2. [§4] §4 (experimental validation): no ground-truth comparison is provided on any restricted parameter subspace where exhaustive enumeration or exact failure probability can be computed; without such a sanity check, it is impossible to separate genuine variance reduction from possible under- or over-estimation induced by the learned distribution.
  3. [§5] §5 (results and reduction claim): the 156.22× figure is stated as a reduction in “required inferences,” yet the manuscript does not clarify whether the cost of the initial uniform samples and the CEM optimization iterations themselves is included in the numerator. If these overhead samples are omitted, the net efficiency gain is overstated.
minor comments (2)
  1. Notation for the failure probability estimator (e.g., ˆp vs. p̂) is used inconsistently across equations and text; a single consistent symbol would improve readability.
  2. Figure captions should explicitly state the number of CEM iterations and elite fraction used for each curve so that readers can reproduce the exact experimental conditions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects of reproducibility, validation, and cost accounting. We address each major comment point by point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: §3 (CEM procedure): the parameterization of the sampling distribution over template variables is not specified (e.g., whether it is a product of independent categorical distributions, a neural density estimator, or a mixture), nor are the initial distribution, number of CEM iterations, or elite-sample fraction given. These choices directly affect both the bias of the failure-rate estimator and the claimed reduction factor.

    Authors: We agree that these details are essential for reproducibility and for evaluating estimator properties. The submitted manuscript described the CEM at a conceptual level but omitted the concrete parameterization for brevity. We will revise §3 to state explicitly that the distribution is a product of independent categorical distributions (one per template variable), initialized uniformly, run for a fixed number of iterations with a standard elite fraction, and include pseudocode. This specification preserves the unbiasedness of the importance-sampling estimator provided the support remains full, which it does. revision: yes

  2. Referee: §4 (experimental validation): no ground-truth comparison is provided on any restricted parameter subspace where exhaustive enumeration or exact failure probability can be computed; without such a sanity check, it is impossible to separate genuine variance reduction from possible under- or over-estimation induced by the learned distribution.

    Authors: We acknowledge the value of an explicit sanity check. While the full template space precludes exhaustive enumeration, a restricted low-dimensional subspace admits exact computation. We will add to §4 a new experiment on such a subspace that compares the CEM estimate against the exact failure probability obtained by enumeration, confirming that the observed reduction aligns with variance reduction rather than bias. This addition directly addresses the concern while remaining computationally tractable. revision: yes

  3. Referee: §5 (results and reduction claim): the 156.22× figure is stated as a reduction in “required inferences,” yet the manuscript does not clarify whether the cost of the initial uniform samples and the CEM optimization iterations themselves is included in the numerator. If these overhead samples are omitted, the net efficiency gain is overstated.

    Authors: The referee correctly identifies an ambiguity. The reported 156.22× factor compares the number of LLM calls needed by uniform Monte Carlo to reach a target confidence-interval width against the number of calls used in the final importance-sampling stage after the distribution has been learned. The initial uniform samples and CEM iterations constitute a one-time setup cost that is not folded into the ratio. In the revision we will (i) state this distinction explicitly in §5, (ii) report the absolute total inference count including overhead for both methods, and (iii) note that the overhead is amortized when the learned distribution is reused across multiple evaluation runs or models, preserving substantial net savings in practice. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's core procedure learns an importance sampling distribution via the standard cross-entropy method applied to observed LLM failures on parameterized templates, then computes an unbiased importance-weighted estimator of the rare-event failure probability. The reported 156.22x reduction is an empirical ratio of sample sizes needed to achieve equivalent variance or confidence bounds under uniform versus learned sampling; it is not obtained by algebraic substitution or by renaming a fitted parameter as a prediction. No equation reduces the target failure rate to a quantity defined in terms of itself, and no load-bearing premise rests on a self-citation whose content is itself unverified or tautological. The method remains externally falsifiable by repeating the sampling experiments on the cited models and benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about systematic failure patterns and on data-dependent parameters of the learned sampling distribution; no new physical entities are postulated.

free parameters (1)
  • CEM sampling distribution parameters
    The distribution over failure-prone inputs is iteratively fitted from observed failures during the CEM procedure.
axioms (1)
  • domain assumption LLM failures exhibit strong systematic patterns across broad parameterized input spaces
    This observation is stated as the key premise enabling the CEM approach.

pith-pipeline@v0.9.0 · 5612 in / 1327 out tokens · 108255 ms · 2026-05-13T02:53:59.823949+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 6 internal anchors

  1. [1]

    2025 , url=

    Seyed Iman Mirzadeh and Keivan Alizadeh and Hooman Shahrokhi and Oncel Tuzel and Samy Bengio and Mehrdad Farajtabar , booktitle=. 2025 , url=

  2. [2]

    Training Verifiers to Solve Math Word Problems

    Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

  3. [3]

    The Eleventh International Conference on Learning Representations , year=

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  4. [4]

    Technometrics , volume=

    Weighted average importance sampling and defensive mixture distributions , author=. Technometrics , volume=. 1995 , publisher=

  5. [5]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author=. arXiv preprint arXiv:2409.12122 , year=

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  7. [7]

    gpt-oss-120b & gpt-oss-20b Model Card

    gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

  8. [8]

    2026 , url=

    Anthropic , title=. 2026 , url=

  9. [9]

    2025 , url =

    OpenAI , title =. 2025 , url =

  10. [10]

    2024 , url =

    OpenAI , title =. 2024 , url =

  11. [11]

    Bowman , booktitle=

    David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

  12. [12]

    Gemini 3 Pro Model Card , year =

  13. [13]

    C. J. Clopper and E. S. Pearson , journal =. The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial , urldate =

  14. [14]

    Methodology and computing in applied probability , volume=

    The cross-entropy method for combinatorial and continuous optimization , author=. Methodology and computing in applied probability , volume=. 1999 , publisher=

  15. [15]

    Annals of operations research , volume=

    A tutorial on the cross-entropy method , author=. Annals of operations research , volume=. 2005 , publisher=

  16. [16]

    2004 , publisher=

    The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation, and machine learning , author=. 2004 , publisher=

  17. [17]

    Efron , title =

    B. Efron , title =. The Annals of Statistics , number =. 1979 , doi =

  18. [18]

    GPT-5.4 Thinking System Card , year =

  19. [19]

    Maia Polo, Felipe and Weber, Lucas and Choshen, Leshem and Sun, Yuekai and Xu, Gongjun and Yurochkin, Mikhail , booktitle =. tiny. 2024 , volume =

  20. [20]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Efficient benchmarking (of language models) , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  21. [21]

    Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Anchor points: Benchmarking models with much fewer examples , author=. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  22. [22]

    arXiv preprint arXiv:2509.11106 , year=

    Fluid language model benchmarking , author=. arXiv preprint arXiv:2509.11106 , year=

  23. [23]

    2026 , month =

    Task-Completion Time Horizons of Frontier AI Models , author =. 2026 , month =

  24. [24]

    Neurips Safe Generative AI Workshop 2024 , year=

    Large Language Model Benchmarks Do Not Test Reliability , author=. Neurips Safe Generative AI Workshop 2024 , year=

  25. [25]

    2026 , eprint=

    Efficient Evaluation of LLM Performance with Statistical Guarantees , author=. 2026 , eprint=

  26. [26]

    Forty-second International Conference on Machine Learning , year=

    Reliable and Efficient Amortized Model-based Evaluation , author=. Forty-second International Conference on Machine Learning , year=

  27. [27]

    Proceedings of The 24th International Conference on Artificial Intelligence and Statistics , pages =

    Deep Probabilistic Accelerated Evaluation: A Robust Certifiable Rare-Event Simulation Methodology for Black-Box Safety-Critical Systems , author =. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics , pages =. 2021 , editor =

  28. [28]

    2022 , issue_date =

    Bai, Yuanlu and Huang, Zhiyuan and Lam, Henry and Zhao, Ding , title =. 2022 , issue_date =. doi:10.1145/3519385 , journal =

  29. [29]

    2025 , eprint=

    Artificial Intelligence Index Report 2025 , author=. 2025 , eprint=

  30. [30]

    2025 , eprint=

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

  31. [31]

    2024 , url=

    Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle=. 2024 , url=

  32. [32]

    2026 , eprint=

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. 2026 , eprint=

  33. [33]

    Proceedings of CVPR , year=

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI , author=. Proceedings of CVPR , year=

  34. [34]

    Proceedings of ACL , year=

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark , author=. Proceedings of ACL , year=

  35. [35]

    Humanity's Last Exam

    A benchmark of expert-level academic questions to assess. Nature , volume =. 2026 , doi =. 2501.14249 , archivePrefix =

  36. [36]

    2026 , eprint=

    ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence , author=. 2026 , eprint=

  37. [37]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

  38. [38]

    doi: 10.18653/v1/W18-5446

    Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel. GLUE : A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP. 2018. doi:10.18653/v1/W18-5446

  39. [39]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. arXiv preprint arXiv:1907.10641 , year=