pith. sign in

arxiv: 2605.30315 · v1 · pith:3AGUCINCnew · submitted 2026-05-28 · 💻 cs.CL · cs.LG

Resolution Diagnostics for Paired LLM Evaluation

Pith reviewed 2026-06-29 07:13 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords LLM evaluationpaired hypothesis testingleaderboard resolutionstatistical powerMMLU-ProOpen LLM LeaderboardCohen's hcorrelation adjustment
0
0 comments X

The pith

Many displayed pairwise rankings on public LLM leaderboards fail to meet conventional paired-test resolution targets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats leaderboard comparisons as paired hypothesis tests and shows that a sizable fraction of reported rankings do not reach target alpha and power levels given the actual number of evaluations performed. It defines a resolution ratio that directly compares the evaluation count used to the count required for a chosen significance and power. This matters for any reader who treats leaderboard order as evidence of model superiority rather than as an uncalibrated display. The work also demonstrates that a widely used shortcut for incorporating correlation into power calculations underestimates the needed sample size by roughly a factor of two when effects are small.

Core claim

Across two public LLM leaderboards, 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs remain unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count increases to 6 of 9 under subject-level clustering and stays between 5 and 6 of 9 in nearly all category-bootstrap resamples. The paper inverts standard level-alpha, power-(1-beta) tests to produce a per-pair resolution ratio q = N/N* and shows that the unpaired Cohen-h-plus-(1-rho) shortcut deviates from the correct N* by approximately a factor of two in the close-comparison regime. The unresolved-pair pattern persists under multiplicity correction and anytime-valid sequential testing.

What carries the argument

The resolution ratio q = N/N*, obtained by inverting a paired hypothesis test to find the minimum sample size N* needed for given alpha and power and then comparing it to the actual number of paired evaluations N.

If this is right

  • Leaderboard displays should include per-pair resolution diagnostics to indicate which rankings rest on adequate evidence.
  • Close model comparisons require substantially more evaluations than current practice supplies to reach conventional power targets.
  • The common unpaired Cohen-h adjustment, even after post-multiplication by (1-rho), systematically understates the sample size needed for paired LLM comparisons.
  • Subject-level clustering increases the number of unresolved pairs and should be accounted for in future leaderboard analyses.
  • The unresolved pattern holds after standard multiplicity corrections and under sequential testing procedures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluation budgets may need to concentrate on deeper testing of near-tied models rather than spreading effort across many models.
  • The same diagnostic could be applied to other ranking systems that rely on paired human or automatic judgments.
  • If the resolution ratio becomes standard, model-release announcements would likely report evaluation counts relative to the required N* for their claimed superiority.
  • Clustering effects imply that prompt or subject diversity in evaluation sets directly affects the reliability of reported rankings.

Load-bearing premise

Leaderboard scores can be modeled as draws from a standard paired hypothesis test whose power is fully determined by an estimable correlation and effect size, without large unmodeled dependence or selection effects.

What would settle it

Re-analysis of the raw per-prompt scores underlying the leaderboards that finds every reported adjacent pair actually achieves at least 80 percent power at alpha 0.05 under the paired test.

Figures

Figures reproduced from arXiv: 2605.30315 by Anany Kotawala.

Figure 1
Figure 1. Figure 1: Efficiency gain of paired McNemar over the unpaired Gaussian formula of Miller (2024) across the 40 OLL v1 pairs. Open circles are empirical N ⋆ Miller/N⋆ paired at the pair’s ρˆ; dashed curve is the textbook prediction 1/(1−ρ) in the equal-marginal limit. Median empirical gain is 2.15×. Paper roadmap. §3 defines the inversion and the resolu￾tion ratio q. §4 instantiates for paired-binary accuracy and prov… view at source ↗
Figure 2
Figure 2. Figure 2: Resolution diagnostic on 40 OLL v1 pairwise comparisons, faceted by task. Each bar gives r = N ⋆ /N = 1/q for one pair, sorted ascending within task; N ⋆ at observed (ˆpA, pˆB, ρˆ) via Equation (6). Bars above the dashed line (r > 1) are unresolved at (α, 1−β) = (0.05, 0.8) (11/40 across all panels). 10 −2 10 −1 10 0 10 1 10 2 10 3 N ⋆ / N 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10 adjacent rank pair N ⋆ = N (re… view at source ↗
Figure 3
Figure 3. Figure 3: MMLU-Pro top-10 adjacent pairs (OLL v2, N = 12,032). Bars give r = N ⋆ /N = 1/q for each adjacent-rank pair (log scale); the dashed line marks r = 1. Pairs whose bar extends past the dashed line (r > 1, unresolved) are unresolved at (α, 1−β) = (0.05, 0.8): four of nine. 6.3. Leaderboard-scale multiplicity We report multiplicity under two pre-declared families: adjacent-rank claims (family size K−1) and all… view at source ↗
Figure 4
Figure 4. Figure 4: Mixture e-process trajectories for paired Bernoulli under H0 (δ=0, grey) and H1 (δ=2.4%, blue), calibrated to an ARC pair (ρˆ=0.64). The horizontal dashed line is the rejection threshold log(1/α) at α=0.05. H0 trajectories stay below; H1 trajectories cross between n∼103 and n∼5×103 (fixed-n McNemar-Connor N ⋆=2362 on this pair). McNemar-Connor threshold z1−α/2 by an anytime-valid threshold (valid simultane… view at source ↗
Figure 5
Figure 5. Figure 5: reports bootstrap power against n on synthetic paired Bernoulli data with pA = 0.65, δ ∈ {0.02, 0.04, 0.08}, ρ ≈ 0.3 via a latent Gaussian copula, N = 30,000. The bootstrap crosses the 0.8 target within 5% of N⋆ from Equation (6) for all three δ [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Non-Bernoulli stress test: paired graded scores from Beta(4, 2) marginals. Bootstrap tracks the paired-t required-N within ±6%. By Ville’s inequality (a martingale maximal inequality), en has Pr(supn en ≥ 1/α) ≤ α at any stopping time. The rejection rule “reject the first n at which en ≥ 1/α” is therefore anytime-valid. Mixture choice. We use a discrete uniform mixture over θ ∈ {0.01, . . . , 0.49, 0.51, .… view at source ↗
Figure 7
Figure 7. Figure 7: Frontier panel on a 3,000-item MMLU-Pro subsample (N=1,350). Bars give N ⋆ for each adjacent-rank pair; the dashed line marks the actual N. Llama-4-Maverick vs. DeepSeek-V3.2 (rank 3 vs. 4) is unresolved at (0.05, 0.8). example: (b − c)/N = 46/10,042 = 0.46 pp. The ≈ 2.4×106 Winogrande Mistral-I/Llama-3-8B figure reflects ˆδ ≈ 0 (b − c = 1 out of 241 discordant pairs) and should be read as “far beyond reso… view at source ↗
read the original abstract

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs are unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count rises to 6/9 under real subject-level clustering and stays at 5-6 out of 9 in 99.9% of category-bootstrap resamples. We frame paired LLM evaluation as a hypothesis-testing problem, invert level-alpha, power-(1-beta) tests, and report a per-pair resolution ratio q = N/N* as the primary diagnostic. A sharp small-effect expansion with an explicit second-order constant shows that the widely-used unpaired Cohen-h-plus-(1-rho) shortcut deviates from the correct N* by approximately a factor of two in the close-comparison regime, a deficit that three of five off-the-shelf calculators(Cohen 1988, G*Power, R pwr) silently inherit when the user post-multiplies their per-arm output by (1-rho). The unresolved-pair pattern remains under multiplicity correction and anytime-valid sequential testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that many pairwise rankings on public LLM leaderboards do not achieve conventional statistical resolution under paired evaluation designs. Specifically, it reports that 11 of 40 comparisons on the Open LLM Leaderboard v1 and 4 of 9 adjacent pairs on MMLU-Pro are unresolved at (alpha=0.05, 1-beta=0.8), with the MMLU-Pro figure increasing to 6/9 under subject-level clustering. The central contribution is the resolution ratio q = N/N* obtained by inverting the paired-test power equation, along with a small-effect expansion showing that common Cohen's h shortcuts underestimate required N by a factor of two.

Significance. Should the empirical counts and derivations prove robust, this work would be significant for the field of LLM evaluation by providing a concrete, falsifiable diagnostic for when leaderboard differences are statistically resolvable. The explicit critique of off-the-shelf power calculators and the second-order expansion represent technical strengths that could improve evaluation standards if adopted.

major comments (3)
  1. Abstract: The counts of unresolved pairs rely on estimates of rho and per-item variances for the Open LLM Leaderboard v1 pairs, but the abstract (and by extension the methods) does not describe how these are obtained, leaving open whether shared training data or prompt overlap affects the effective variance used in N*.
  2. Power formula inversion (central derivation): The resolution ratio q is derived from observed deltas; this post-hoc approach means that pairs with small observed differences (due to noise or otherwise) are automatically labeled unresolved, which is the classic post-hoc power pathology and requires explicit discussion of its implications for the diagnostic.
  3. MMLU-Pro analysis: While subject-level clustering is applied, the manuscript should specify the exact model for clustering (e.g., which equation for the variance inflation) and how it interacts with the rho estimation to produce the increase from 4/9 to 6/9.
minor comments (1)
  1. Abstract: The phrase 'three of five off-the-shelf calculators' could be clarified by naming them explicitly in the main text for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these constructive comments that highlight opportunities to strengthen the clarity of the manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: Abstract: The counts of unresolved pairs rely on estimates of rho and per-item variances for the Open LLM Leaderboard v1 pairs, but the abstract (and by extension the methods) does not describe how these are obtained, leaving open whether shared training data or prompt overlap affects the effective variance used in N*.

    Authors: The rho and per-item variance estimates are computed empirically from the observed paired item-level scores on each benchmark. We will revise the abstract to reference this empirical estimation and add an explicit methods paragraph describing the computation from the leaderboard data. This will make clear that any effects from shared training data or prompt overlap are already incorporated into the observed variances used for N*. revision: yes

  2. Referee: Power formula inversion (central derivation): The resolution ratio q is derived from observed deltas; this post-hoc approach means that pairs with small observed differences (due to noise or otherwise) are automatically labeled unresolved, which is the classic post-hoc power pathology and requires explicit discussion of its implications for the diagnostic.

    Authors: The diagnostic intentionally conditions on the observed delta to assess whether a reported leaderboard difference meets the resolution target under the actual paired design and sample size; this differs from the classic pathology of using post-hoc power to reinterpret a non-significant result. Nevertheless, we will add a dedicated discussion paragraph distinguishing the two uses and addressing implications for noisy small deltas, including sensitivity checks. revision: yes

  3. Referee: MMLU-Pro analysis: While subject-level clustering is applied, the manuscript should specify the exact model for clustering (e.g., which equation for the variance inflation) and how it interacts with the rho estimation to produce the increase from 4/9 to 6/9.

    Authors: We will expand the MMLU-Pro section to state the exact variance inflation model VIF = 1 + (k-1) rho_s (with k items per subject and rho_s the intra-subject correlation estimated from the data) and show how this adjusted variance is combined with the pairwise rho before inverting the power equation. The revised text will also report the intermediate values that produce the shift from 4/9 to 6/9 unresolved pairs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard power inversion applied to external leaderboard data

full rationale

The paper inverts conventional paired t-test power formulas (with explicit small-effect expansion) to obtain per-pair resolution ratio q = N/N* from observed deltas and estimated rho on public leaderboards. This is a direct application of textbook hypothesis-testing machinery to independent external scores rather than any self-definitional loop, fitted-input prediction, or self-citation chain. The reported counts (11/40, 4/9, etc.) are empirical tallies under the chosen (alpha, 1-beta) targets; no equation reduces to its own inputs by construction, and the derivation remains self-contained against external statistical references.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard paired t-test or proportion-test power formulas plus an empirical correlation rho estimated from leaderboard data; no new entities are postulated.

free parameters (2)
  • rho (pairwise correlation)
    Used in the paired sample-size formula; value is data-dependent and not derived from first principles.
  • effect size delta
    Observed score differences on the leaderboards; treated as fixed inputs to the power calculation.
axioms (1)
  • domain assumption Paired observations follow the standard model for which the power formula N* is derived (typically normality or binomial with known variance).
    Invoked when inverting the level-alpha, power-(1-beta) test to obtain N*.

pith-pipeline@v0.9.1-grok · 5749 in / 1410 out tokens · 20519 ms · 2026-06-29T07:13:02.872288+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 6 canonical work pages · 1 internal anchor

  1. [2]

    ICML 2025 (Position track)

    URL https://arxiv.org/abs/2503.10694. ICML 2025 (Position track). Bean, A. M., Kearns, R. O., Romanou, A., Hafner, F. S., Mayne, H., et al. Measuring what matters: Con- struct validity in large language model benchmarks. arXiv preprint arXiv:2511.04703,

  2. [3]

    NeurIPS 2025 Datasets and Benchmarks Track

    URL https: //arxiv.org/abs/2511.04703. NeurIPS 2025 Datasets and Benchmarks Track. Benjamini, Y . and Hochberg, Y . Controlling the false discov- ery rate: A practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300,

  3. [4]

    With little power comes great respon- sibility

    Card, D., Henderson, P., Khandelwal, U., Jia, R., Mahowald, K., and Jurafsky, D. With little power comes great respon- sibility. InProceedings of the 2020 Conference on Empir- ical Methods in Natural Language Processing (EMNLP), pp. 9263–9274,

  4. [6]

    9 Resolution Diagnostics for Paired LLM Evaluation Grünwald, P., de Heide, R., and Koolen, W

    URL https://arxiv.org/abs/2510.23191. 9 Resolution Diagnostics for Paired LLM Evaluation Grünwald, P., de Heide, R., and Koolen, W. Safe testing. Journal of the Royal Statistical Society: Series B (Statis- tical Methodology),

  5. [8]

    Position: AI Evaluations Should be Grounded on a Theory of Capability

    URL https://arxiv.org/abs/2509.19590. Liddell, F. D. K. Simplified exact analysis of case-referent studies: Matched pairs; dichotomous exposure.Journal of Epidemiology and Community Health, 37(1):82–84,

  6. [10]

    Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, and Dieuwke Hupkes

    URL https:// arxiv.org/abs/2406.10229. McNemar, Q. Note on the sampling error of the differ- ence between correlated proportions or percentages.Psy- chometrika, 12(2):153–157,

  7. [12]

    org/abs/2411.00640

    URL https://arxiv. org/abs/2411.00640. Polo, F. M., Weber, L., Choshen, L., Sun, Y ., Xu, G., and Yurochkin, M. tinyBenchmarks: Evaluating LLMs with fewer examples. InProceedings of the 41st International Conference on Machine Learning (ICML),

  8. [13]

    reject the firstnat whiche n ≥1/α

    under H0. With a discrete mixture priorνover the alternative discordance probabilityθ∈(0, 1 2)∪( 1 2 ,1), the e-process is en = Z θbn(1−θ) cn (1/2)bn+cn dν(θ),(24) 12 Resolution Diagnostics for Paired LLM Evaluation 102 103 paired sample size n 0.0 0.2 0.4 0.6 0.8 1.0bootstrap power Δ ̄μ = 0.02 Δ ̄μ = 0.04 Δ ̄μ = 0.08 paired-t N⋆ Figure 6.Non-Bernoulli st...

  9. [14]

    Empirically, a Beta(2,2) mixture and a discrete two-point mixture at θ∈ {0.4,0.6} both produce stopping times within ≈8% of the uniform on our calibration pairs

    require knowing the alternative density and are sensitive to misspecification. Empirically, a Beta(2,2) mixture and a discrete two-point mixture at θ∈ {0.4,0.6} both produce stopping times within ≈8% of the uniform on our calibration pairs. Calibration.We calibrate the mixture e-process on simulated paired Bernoulli calibrated to two ARC pairs ( ˆδ= 2.4%,...

  10. [15]

    Table 6.Raw discordance and pair details for the seven OLL v1 close pairs ( |ˆδ| ≤2 pp)

    without re-running the code, and trace each rank pair to its concrete model identifiers. Table 6.Raw discordance and pair details for the seven OLL v1 close pairs ( |ˆδ| ≤2 pp). b=nAB and c=nBA are the discordant counts; ˆρ is the empirical Bernoulli correlation; N ⋆ is the McNemar-Connor required-N from Equation (6). The HellaSwag row reconciles the §1 1...