Submodular Benchmark Selection

· 2026 · cs.AI · arXiv 2605.02209

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Evaluating large language models across many benchmarks is expensive, yet many benchmarks are highly correlated. We formalize the selection of a small, informative subset as submodular maximization under a multivariate Gaussian model. Entropy (log-determinant covariance) and mutual information between selected and remaining benchmarks arise as natural objectives. Both are submodular; entropy selection coincides with pivoted Cholesky and has spectral residual bounds, while mutual information is non-monotone in general but empirically monotone for small subsets, so we optimize it greedily. Experiments on three matrices from ten public leaderboards show that mutual information selection outperforms entropy for imputation at small subsets.

representative citing papers

Complement Submodular Information Measures for Balanced and Robust Data Selection

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

Introduces complement-aware submodular functions (CSI) that preserve structure between subset and complement for improved robust data selection.

ProactBench: Beyond What The User Asked For

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Complement Submodular Information Measures for Balanced and Robust Data Selection cs.LG · 2026-05-23 · unverdicted · none · ref 31 · internal anchor
Introduces complement-aware submodular functions (CSI) that preserve structure between subset and complement for improved robust data selection.
ProactBench: Beyond What The User Asked For cs.LG · 2026-05-09 · unverdicted · none · ref 147 · internal anchor
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

Submodular Benchmark Selection

fields

years

verdicts

representative citing papers

citing papers explorer