arxiv: 2605.02209 · v1 · submitted 2026-05-04 · 💻 cs.AI · cs.LG

Recognition: 3 theorem links

· Lean Theorem

Submodular Benchmark Selection

Alexander Smola

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:18 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords submodular maximizationbenchmark selectionmutual informationentropymultivariate Gaussianlarge language modelsimputationleaderboards

0 comments

The pith

Selecting small benchmark subsets with mutual information under a multivariate Gaussian model improves imputation of large language model performance compared to entropy selection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to reduce the cost of evaluating large language models by selecting a small number of benchmarks that still allow accurate prediction of performance on the full collection. It does this by modeling the scores as coming from a multivariate Gaussian whose correlations capture the redundancies among benchmarks. This leads to two submodular objectives—entropy of the selected set and mutual information with the unselected set—that can be maximized greedily, with experiments indicating the mutual information approach gives better results for small selections.

Core claim

By modeling benchmark scores across models with a multivariate Gaussian, the selection of a small informative subset becomes a submodular maximization problem with natural objectives given by the entropy of the selected covariance or the mutual information between selected and remaining benchmarks. Both objectives are submodular, entropy selection matches pivoted Cholesky with spectral residual bounds, and mutual information, though generally non-monotone, is monotone for small subsets in practice and optimized greedily. Experiments on three matrices from ten public leaderboards demonstrate that mutual information selection outperforms entropy for imputation at small subset sizes.

What carries the argument

The multivariate Gaussian model of benchmark scores, from which entropy and mutual information are derived as submodular objectives for selecting informative subsets.

If this is right

Entropy selection coincides with pivoted Cholesky and admits spectral residual bounds.
Mutual information is empirically monotone for small subsets, allowing greedy optimization.
Mutual information selection outperforms entropy for imputation at small subsets on leaderboard data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could lower the barrier to thorough model evaluation by minimizing the number of benchmarks needed.
The technique may generalize to other correlated evaluation metrics beyond language models.
Further work could explore how the selected subsets change as new models and benchmarks are added to leaderboards.

Load-bearing premise

The scores that models achieve on different benchmarks are well approximated by a multivariate Gaussian distribution with a covariance that remains stable and can be estimated from public leaderboards.

What would settle it

If experiments on additional sets of benchmark scores show that entropy selection achieves lower or equal imputation error compared to mutual information selection for small subsets, the claimed superiority would not hold.

Figures

Figures reproduced from arXiv: 2605.02209 by Alexander Smola.

**Figure 1.** Figure 1: Residual variance fraction 1 − ρ(k) of the correlation matrix for each experiment matrix. Dashed lines mark 90%, 95%, and 99% explained variance. sparse model-by-benchmark matrix. We obtain: MMLU (5,452 × 57, fully observed), MTEB (263 × 56, 77% observed), and MERGED (118 × 114, 31% observed), combining all collections except MTEB via model canonicalization (Appendix E). For MERGED, we keep score-like quan… view at source ↗

**Figure 2.** Figure 2: Cross-validation results for MMLU (5 452 × 57, fully observed). Left: imputation R2 on held-out models. Center: residual variance fraction (semilog). Right: entropy of the selected submatrix. Solid lines: greedy entropy selection; dashed lines: random selection baseline. Shaded bands show ±1 standard deviation across folds. 1 3 5 7 10 13 15 Selected benchmarks k 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Im p utatio … view at source ↗

**Figure 3.** Figure 3: Cross-validation results for MTEB (263 × 56, 77.3% observed). 1 3 5 7 10 13 15 Selected benchmarks k 0.2 0.0 0.2 0.4 0.6 Im p utatio n R 2 10% holdout Random 10% 20% holdout 50% holdout 90% holdout Random 1 3 5 7 10 13 15 Selected benchmarks k 10 0 2 × 10 1 3 × 10 1 4 × 10 1 6 × 10 1 Residual variance fraction 10% holdout 20% holdout 50% holdout 90% holdout 1 3 5 7 10 13 15 Selected benchmarks k 16 14 12 1… view at source ↗

**Figure 4.** Figure 4: Cross-validation results for MERGED (118 × 114, 31.1% observed). On MMLU, with just k = 5 selected subjects the average imputation R2 reaches 0.89; at k = 10 it exceeds 0.92; and at k = 15 it is 0.95. Performance is remarkably stable across holdout fractions: even with 90% (∼545 training models), the R2 at k = 10 remains 0.92. MTEB shows R2 ≈ 0.72 at k = 5 and ≈ 0.85 7 view at source ↗

**Figure 5.** Figure 5: Entropy vs. MI vs. random selection on MMLU. MI achieves higher R2 for small k despite higher residual variance, because it selects benchmarks coupled with the complement. The random baseline (gray dashed) is competitive in R2 but achieves lower entropy and MI, confirming that Gaussian imputation is effective when the data is highly redundant. 1 3 5 7 10 13 15 Selected benchmarks k 0.3 0.4 0.5 0.6 0.7 0.8 … view at source ↗

**Figure 6.** Figure 6: Entropy vs. MI vs. random selection on MTEB. MI leads for small k (∼10 R2 points at k ≤ 3), with entropy overtaking for k ≥ 10. Random selection performs comparably in R2 but is dominated in residual variance and MI. 1 3 5 7 10 13 15 Selected benchmarks k 0.2 0.0 0.2 0.4 0.6 0.8 Im p utatio n R 2 Entropy (Alg. 1) Mutual Info (Alg. 2) Random 1 3 5 7 10 13 15 Selected benchmarks k 2 × 10 1 3 × 10 1 4 × 10 1 … view at source ↗

**Figure 7.** Figure 7: Entropy vs. MI vs. random selection on MERGED. MI provides a substantial advantage over entropy for large k. The random baseline is competitive with both methods in R2 , reflecting the noisy covariance estimates in this sparse regime (31.1% observed). dataset [BenchPress contributors, 2026] (Appendix I). The MERGED matrix is more useful for our main claims because it contains more benchmarks and models. Re… view at source ↗

**Figure 8.** Figure 8: Benchmark overlap network. Each node is a collection; the number inside shows how many of the view at source ↗

**Figure 9.** Figure 9: Selection order for MMLU (10% holdout). Entropy (left) selects high-variance “outlier” subjects; MI (right) selects “hub” subjects with strong predictive links to the rest. (9/10), and helm/NarrativeQA (8/10), while the first position rotates among MMLU, HELM, MT-BENCH, and code benchmarks. This instability is a direct consequence of sparse, heterogeneous data: when off-diagonal correlations are noisy and … view at source ↗

**Figure 10.** Figure 10: Selection order for MTEB (10% holdout). Entropy (left) leads with high-variance classification tasks; MI (right) leads with the broadly connected MassiveIntentClassification. Initialization. Set µ (0) j = ˆµj from (15). For the initial covariance, compute the pairwise-complete estimate (16) and project onto the positive semidefinite cone: eigendecompose, clamp negative eigenvalues to a small ε > 0, and r… view at source ↗

**Figure 11.** Figure 11: Selection order for MERGED (10% holdout). Both methods show high instability due to sparse data, but MI draws from diverse evaluation families (chat, knowledge, code). where B ¯˜ = B˜ − 1M(µ (t+1)) ⊤ is centered, and C (t) i is the N × N matrix whose (j, k)-entry equals (18) when both j, k ∈ B¯ i , and zero otherwise. This correction term accounts for imputation uncertainty and ensures that Σ (t+1) remain… view at source ↗

**Figure 12.** Figure 12: Normality diagnostics: histogram of per-benchmark Shapiro–Wilk view at source ↗

**Figure 13.** Figure 13: Cross-validation results for BenchPress (83 × 49, 33.8% observed). Solid: greedy entropy; dashed: random baseline. Performance is comparable to the MERGED dataset, with R2 ≈ 0.25 at k = 15 under the 10% holdout. 1 3 5 7 10 13 15 Selected benchmarks k 2.0 1.5 1.0 0.5 0.0 0.5 1.0 Im p utatio n R 2 Entropy (Alg. 1) Mutual Info (Alg. 2) Random 1 3 5 7 10 13 15 Selected benchmarks k 10 1 Residual variance frac… view at source ↗

**Figure 14.** Figure 14: Entropy vs. MI vs. random selection on BenchPress. MI selection is unstable (negative R2 on several folds) due to poorly estimated complement precision. Entropy and random selection achieve comparable R2 ≈ 0.2. low-rank factor models Y = UV ⊤ with diverse missingness patterns (MCAR, MAR, MNAR), requiring no fitting or hyperparameter tuning at inference time. This architecture is well suited to our setting… view at source ↗

**Figure 15.** Figure 15: TABIMPUTE V2 imputation R2 on all four datasets (10-fold CV, 10% holdout). Compare with Figures 5–7 (Gaussian imputation), where R2 is substantially higher across the board. Discussion. The Gaussian model’s advantage is not surprising: it estimates the covariance Σˆ from all training rows and exploits this structure directly in the conditional mean formula, which is the minimum-variance linear predictor. … view at source ↗

**Figure 16.** Figure 16: Residual variance of the logit-transformed correlation matrix. Compare with Figure 1 (raw scores). view at source ↗

**Figure 17.** Figure 17: Greedy entropy CV in logit space (solid: greedy, dashed: random) for view at source ↗

**Figure 18.** Figure 18: Entropy vs. MI vs. random selection in logit space (10-fold CV, 10% holdout) for view at source ↗

read the original abstract

Evaluating large language models across many benchmarks is expensive, yet many benchmarks are highly correlated. We formalize the selection of a small, informative subset as submodular maximization under a multivariate Gaussian model. Entropy (log-determinant covariance) and mutual information between selected and remaining benchmarks arise as natural objectives. Both are submodular; entropy selection coincides with pivoted Cholesky and has spectral residual bounds, while mutual information is non-monotone in general but empirically monotone for small subsets, so we optimize it greedily. Experiments on three matrices from ten public leaderboards show that mutual information selection outperforms entropy for imputation at small subsets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper casts benchmark selection as submodular maximization under a Gaussian model on leaderboard scores and reports that mutual information beats entropy for imputation at small subset sizes.

read the letter

Smola frames the problem of trimming redundant LLM benchmarks as submodular maximization with a multivariate Gaussian on model scores across tasks. Entropy reduces to the log-determinant of the covariance and links to pivoted Cholesky with spectral residual bounds. Mutual information between the selected and unselected sets is the other objective; it is not monotone in general but appears monotone for the small sizes that matter, so greedy selection applies. On three matrices drawn from ten public leaderboards, the mutual-information subsets give lower imputation error than entropy subsets when the held-out scores are filled in under the same model.

Referee Report

3 major / 2 minor

Summary. The paper formalizes the selection of a small informative subset of benchmarks for LLM evaluation as a submodular maximization problem under a multivariate Gaussian model of benchmark scores. It derives entropy (log-determinant of the covariance) and mutual information (between selected and unselected benchmarks) as natural objectives, proves submodularity for entropy with spectral residual bounds, notes that mutual information is non-monotone in general but empirically monotone for small subsets (allowing greedy optimization), and reports experiments on three matrices derived from ten public leaderboards showing that mutual information selection outperforms entropy-based selection for imputation at small subset sizes.

Significance. If the multivariate Gaussian model and the reported outperformance hold, the work provides a principled, computationally efficient framework for reducing the cost of comprehensive LLM evaluation by identifying redundant benchmarks. The submodular formulation and explicit connection to pivoted Cholesky for entropy selection are strengths, as is the use of real leaderboard data for validation.

major comments (3)

[Abstract and Experiments] Abstract and Experiments section: The central empirical claim that mutual information selection outperforms entropy for imputation rests on covariance matrices estimated directly from the same leaderboard data used to measure selection quality. No held-out models, external validation set, or cross-validation procedure is described, so the reported gains are not independent of the fitting procedure and may not generalize to new models.
[Abstract and model description] Abstract and model description: Benchmark scores are bounded in [0,1], frequently discrete or skewed, and the number of models is often modest relative to the number of benchmarks. The paper provides no details on covariance estimation, regularization (to avoid singularity), or any diagnostic checks (e.g., normality tests or Q-Q plots) for the multivariate Gaussian assumption that underlies both objectives and the imputation metric.
[Theory and Experiments] Theory and Experiments: Mutual information is stated to be non-monotone in general but 'empirically monotone for small subsets,' yet no quantitative details, error bars, or description of the test procedure (e.g., how many subsets were checked, on which matrices) are supplied. Without this, the justification for using the standard greedy algorithm lacks rigor.

minor comments (2)

[Model section] Notation for the covariance matrix and its estimation procedure should be introduced with an explicit equation early in the model section.
[Experiments] The three matrices used in experiments should be described with their dimensions (number of models vs. benchmarks) and any preprocessing steps.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. The concerns regarding empirical validation, model assumptions, and theoretical rigor are well-taken and point to opportunities to strengthen the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The central empirical claim that mutual information selection outperforms entropy for imputation rests on covariance matrices estimated directly from the same leaderboard data used to measure selection quality. No held-out models, external validation set, or cross-validation procedure is described, so the reported gains are not independent of the fitting procedure and may not generalize to new models.

Authors: We acknowledge that the current experiments estimate covariance and evaluate imputation performance on the same leaderboard data, which limits claims of generalization to entirely new models. The design was intended to demonstrate the method's behavior on real public leaderboards. In the revision we will add a cross-validation protocol that holds out a subset of models when estimating covariance and then measures imputation error on the held-out models. This will provide an out-of-sample assessment of the reported gains. revision: yes
Referee: [Abstract and model description] Abstract and model description: Benchmark scores are bounded in [0,1], frequently discrete or skewed, and the number of models is often modest relative to the number of benchmarks. The paper provides no details on covariance estimation, regularization (to avoid singularity), or any diagnostic checks (e.g., normality tests or Q-Q plots) for the multivariate Gaussian assumption that underlies both objectives and the imputation metric.

Authors: We agree that the manuscript should supply these implementation and diagnostic details. In the revised version we will expand the model section to describe: the covariance estimator employed (sample covariance with shrinkage or ridge regularization to ensure positive-definiteness), the handling of bounded and possibly discrete scores, and any normality diagnostics or explicit limitations of the Gaussian approximation. We will also discuss the implications of having fewer models than benchmarks. revision: yes
Referee: [Theory and Experiments] Theory and Experiments: Mutual information is stated to be non-monotone in general but 'empirically monotone for small subsets,' yet no quantitative details, error bars, or description of the test procedure (e.g., how many subsets were checked, on which matrices) are supplied. Without this, the justification for using the standard greedy algorithm lacks rigor.

Authors: We will augment the experimental section with the requested quantitative support. The revision will specify the exact procedure used to assess empirical monotonicity (number of subsets sampled, the three covariance matrices tested, and the observed marginal-gain statistics), include error bars or confidence intervals where appropriate, and present the results in a table or figure. This will make the justification for greedy optimization fully reproducible and rigorous. revision: yes

Circularity Check

1 steps flagged

MI selection 'outperforms' entropy for imputation by directly optimizing the conditional entropy that defines the imputation metric under the fitted MVN

specific steps

fitted input called prediction [Abstract (experiments paragraph) and model definition sections]
"Experiments on three matrices from ten public leaderboards show that mutual information selection outperforms entropy for imputation at small subsets."

Covariance is estimated from the same leaderboard matrices used both to compute the MI/entropy objectives and to evaluate imputation error. Imputation error is the conditional entropy H(remaining | selected) under the fitted MVN; MI selection greedily maximizes I(selected; remaining) which is exactly H(remaining) - H(remaining|selected). Therefore the 'outperformance' for the imputation metric is the expected result of optimizing that metric, not an independent empirical finding.

full rationale

The paper fits a multivariate Gaussian covariance to the leaderboard score matrices, defines both selection objectives and the imputation error via that same model (entropy = log-det, MI = reduction in conditional entropy), then reports that greedy MI selection yields better imputation than entropy selection on the identical fitted matrices. Because the evaluation metric is exactly the quantity being maximized by the MI procedure (I(S; rest) = H(rest) - H(rest|S)), the reported superiority is a direct consequence of optimizing the evaluation criterion rather than an independent test on held-out data or non-Gaussian predictors. No external validation set or parameter-free derivation is described.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that benchmark scores are jointly Gaussian and that the empirical covariance estimated from existing leaderboards is representative. No free parameters are explicitly named beyond the covariance matrix itself; no new entities are introduced.

free parameters (1)

covariance matrix
Estimated from leaderboard data; used both to define the objectives and to evaluate imputation performance.

axioms (1)

domain assumption Benchmark scores across models follow a multivariate Gaussian distribution
Invoked to make entropy and mutual information well-defined and submodular.

pith-pipeline@v0.9.0 · 5382 in / 1220 out tokens · 41035 ms · 2026-05-08T19:18:47.124887+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel (canonical RS cost J(x)=½(x+x⁻¹)−1 is unrelated to log-det Gaussian entropy used here) unclear
We model scores as draws from a multivariate Gaussian and select a subset A of benchmarks to maximize either the entropy H(X_A) or the mutual information I(X_A;X_¯A). Both objectives are submodular.
IndisputableMonolith.Foundation.Atomicity atomic_tick / sequential_preserves_conservation (RS's combinatorial scheduling/ordering machinery is structurally different from greedy submodular maximization) unclear
Entropy greedy is pivoted Cholesky, runs in O(k²N) time... MI is non-monotone in general but empirically monotone for small k, so we use greedy as a heuristic rather than invoking the standard monotone-submodular guarantee.
IndisputableMonolith.Cost Jcost / Jcost_unit0 (RS cost is scale-fixed with unique zero at x=1; paper's entropy is modular-shift-invariant — no overlap with J's rigidity) unclear
f1(S) = H(X_S) = ½ log det(2πe Σ_SS) ... maximizing log det(Σ_AA), since additive constants do not affect the argmax.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ProactBench: Beyond What The User Asked For
cs.LG 2026-05 unverdicted novelty 7.0

ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

Reference graph

Works this paper leans on

22 extracted references · 11 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page Pith review arXiv
[2]

GitHub repository; includes the BenchPress method. Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y Lee, Haonan Li, Ch...

work page arXiv
[3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavar...

work page internal anchor Pith review arXiv
[4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page Pith review arXiv
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page Pith review arXiv
[6]

TabIm- pute: Universal zero-shot imputation for tabular data.arXiv preprint arXiv:2510.02625,

Jacob Feitelberg, Dwaipayan Saha, Kyuseong Choi, Zaid Ahmad, Anish Agarwal, and Raaz Dwivedi. TabIm- pute: Universal zero-shot imputation for tabular data.arXiv preprint arXiv:2510.02625,

work page arXiv
[7]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- hardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematic...

work page internal anchor Pith review arXiv
[8]

From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939,

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-Hard and BenchBuilder pipeline. arXiv preprint arXiv:2406.11939,

work page arXiv
[9]

Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770, 2024

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. WildBench: Benchmarking LLMs with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770,

work page arXiv
[10]

MTEB: Massive text embedding benchmark

11 Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037,

2014
[11]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jessica Thompson, Phu Mon Htut, and Samuel R Bowman

Hugging Face collection. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jessica Thompson, Phu Mon Htut, and Samuel R Bowman. BBQ: A hand-built bias benchmark for question answering. InFindings of the Association for Computational Linguistics: ACL 2022,

2022
[12]

Do these LLM benchmarks agree? fixing benchmark evaluation with BenchBench

Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel, Eyal Shnarch, Michal Shmueli-Scheuer, and Leshem Choshen. Do these LLM benchmarks agree? fixing benchmark evaluation with BenchBench. arXiv preprint arXiv:2407.13696,

work page arXiv
[13]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review arXiv
[14]

[2023], BIG-BENCHHARD, MATH LEVEL5, GPQA Rein et al

replaced them with 6 harder alternatives: IFEVALZhou et al. [2023], BIG-BENCHHARD, MATH LEVEL5, GPQA Rein et al. [2024], MUSR Sprague et al. [2024], and MMLU-PROWang et al. [2024]. EleutherAI’s Language Model Evaluation Harness Biderman et al

2023
[15]

propose a modified greedy algorithm that runs two strategies in parallel. The first is acost-effective greedy: at each step, select the affordable element with the largest ratio of marginal gain to cost, ∆(v|S)/c(v) , and continue until the budget is exhausted. The second is abest-singleton strategy: simply pick the single element with the highest f-value...

2005
[16]

B.3 Budget Constraint for Entropy When benchmarks have non-uniform costs cj >0 , the cardinality constraint is replaced by P j∈A cj ≤ C

suffices. B.3 Budget Constraint for Entropy When benchmarks have non-uniform costs cj >0 , the cardinality constraint is replaced by P j∈A cj ≤ C . Following Krause and Guestrin [2005], two strategies are run in parallel: a cost-effective greedy that selects the affordable element with the largest gain-to-cost ratio ∆c(j| A)/c j until the budget is exhaus...

2005
[17]

Since the greedy algorithmgrows S one element at a time, we can maintain a Cholesky factorization ofΣ SS via rank-one updates at costO(k 2n)overall

Both conditional variances have closed forms via the Schur complement: σ2 v|S = Σvv −Σ vS Σ−1 SS ΣSv ,(11) σ2 v| ¯Sv = Σvv −Σ v ¯Sv Σ−1 ¯Sv ¯Sv Σ ¯Svv.(12) For the entropy objective f1, only (11) is needed. Since the greedy algorithmgrows S one element at a time, we can maintain a Cholesky factorization ofΣ SS via rank-one updates at costO(k 2n)overall. F...

2013
[18]

Lazy evaluation applies to the selection criterion but not to the refactorization, which must be performed eagerly

rank-one downdate (13), but for N in the hundreds and k≤15 the wall-clock overhead is negligible (under one second on all our datasets). Lazy evaluation applies to the selection criterion but not to the refactorization, which must be performed eagerly. D Dataset Details We assemble score matrices from ten public leaderboards spanning general language unde...

2024
[19]

E Model Canonicalization To merge score matrices from different benchmark collections into a unified table, we must identify when the same model appears under different names

Auxiliary count, uncertainty, and length columns are excluded so that all columns represent benchmark scores or win rates. E Model Canonicalization To merge score matrices from different benchmark collections into a unified table, we must identify when the same model appears under different names. Table 2 lists the 118 models that we identified across at ...

2024
[20]

MassiveIntentClassification is the most stable top pick (10/10, mean rank 2.9), followed by STS15 and STS17

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Selection position k MassiveIntentClassification STS15 STS17 AmazonPolarityClassification BiorxivClusteringS2S MedrxivClusteringP2P NQ SciFact SciDocsRR ArxivClusteringS2S RedditClustering AskUbuntuDupQuestions ArxivClusteringP2P StackExchangeClustering STS13 STS14 DBPedia SprintDuplicateQuestions T oxicConversationsCla...

2014
[21]

= 3,363). Implications.The formal rejection of Gaussianity is unsurprising: with M= 5,452 observations, even minor departures from normality are statistically detectable, and benchmark scores are bounded, discrete, and occasionally multimodal. The important question is whether these departures undermine thepracticalutility of the Gaussian imputation. The ...

2026
[22]

Entropy and random selection are com- parable; MI remains unstable

Random (d) BENCHPRESS. Entropy and random selection are com- parable; MI remains unstable. Figure 15: TABIMPUTEV2 imputation R2 on all four datasets (10-fold CV , 10% holdout). Compare with Figures 5–7 (Gaussian imputation), whereR 2 is substantially higher across the board. Discussion.The Gaussian model’s advantage is not surprising: it estimates the cov...

2026