pith. sign in

arxiv: 2605.25773 · v2 · pith:5HRNFCI6new · submitted 2026-05-25 · 📊 stat.ML · cs.AI· cs.CL· cs.LG

Efficient Benchmarking Is Just Feature Selection and Multiple Regression

Pith reviewed 2026-06-29 20:18 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.CLcs.LG
keywords efficient benchmarkingfeature selectionkernel ridge regressionmRMRLLM evaluationmultiple regressionprediction errorranking correlation
0
0 comments X

The pith

Reframing efficient benchmarking as feature selection plus kernel ridge regression yields lower prediction errors and stronger rank correlations for LLM scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that predicting full benchmark scores from a small subset of questions works best when the task is treated as multiple regression. Kernel ridge regression for the prediction step already improves on existing efficient benchmarking techniques, and pairing it with the mRMR algorithm for choosing the questions improves results further. These gains appear in both absolute error measures and in how well the predicted rankings match the true ones, and they hold across binary and continuous metrics on several benchmarks. The new combination is also faster to run and selects more stable question subsets than prior approaches. A reader would care because cheaper, more reliable ways to estimate model performance matter when full evaluation is expensive.

Core claim

By casting efficient benchmarking as multiple regression with feature selection, kernel ridge regression at the prediction stage combined with mRMR for selecting question subsets consistently produces smaller errors in mean absolute and root mean squared terms, and stronger Spearman and Kendall rank correlations between predicted and actual scores than prior methods, while being computationally faster and more stable across different data splits.

What carries the argument

minimum redundancy maximum relevance (mRMR) feature selection paired with kernel ridge regression for predicting full benchmark scores from a chosen subset of questions

Load-bearing premise

That performance on a small, selected subset of benchmark questions is sufficiently predictive of performance on the full set via a kernel ridge regression model, without the subset choice or model introducing systematic bias across models or benchmarks.

What would settle it

Running the mRMR-plus-kernel-ridge method on a fresh benchmark and model collection and finding that its MAE, RMSE, Spearman ho, or Kendall au values are worse than those from existing efficient benchmarking techniques.

Figures

Figures reproduced from arXiv: 2605.25773 by Acyr Locatelli, Kris Cao, Sam Bowyer.

Figure 1
Figure 1. Figure 1: Our method (mRMR+/++) leads to significantly better benchmark score predictions in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) With M = 30, mRMR++ consistently achieves lower predictive errors and higher ranking correlations than other methods at a variety of coreset sizes. (b) Whilst mRMR methods struggle with very small M, for M ∈ {30, 50}, mRMR++ dominates in both errors and ranking correlations using 10% coresets. (For clarity, we omit d = 1 methods here except for mRMR to show the relative size of underperformance.) Refit… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Question difficulty distribution (proportion of all models which fail a given question) over coresets (coloured) vs full benchmarks (grey). (MMLU, other datasets are shown in § K.) (b) mRMR achieves greater coreset stability across random seeds, M and coreset size compared to all other competitive methods. (c) mRMR is significantly faster than all others except random sampling. We ran each method with … view at source ↗
Figure 4
Figure 4. Figure 4: (a; M = 32, see § H for other M values) Continuous non-pass@k benchmarks. (b; M = 15) Constructing coresets with pass@1 and predicting on all k ∈ {1, 2, 4, . . . , 64} . 3.1 Binary scores For binary evaluations, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: True vs. predicted summary scores on ARC-C for source (train) models and test models for [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance gains from polynomial ridge regresion become less pronounced as the [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Same setting as Fig. 6 but with relative-sized coresets: [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Binary-dataset results on 10% coresets using [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablations on mRMR schemes and MI-estimator nearest-neighbour hyperparameter [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ablations on mRMR schemes and MI-estimator nearest-neighbour hyperparameter [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablations on mRMR schemes and MI-estimator nearest-neighbour hyperparameter [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Ablations on binary experiments over the form of prediction (ridge regression, polynomial [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The relevance, redundancy, difference (relevance [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Our method maximises the MIQ mRMR objective (Eq. (6)) better than other methods, [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Feature selection metrics during coreset construction for different mRMR variants. [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Ablations on IRT dimensionality p ∈ {1, 5, 10} for gp-IRT on the binary experiments in § 3.1. 5 10 15 Coreset Size (%) 0.4 0.6 0.8 (a) RMSE (%) 5 10 15 Coreset Size (%) 0.2 0.4 0.6 MAE (%) 5 10 15 Coreset Size (%) 0.85 0.90 Kendall  5 10 15 Coreset Size (%) 0.94 0.96 0.98 Spearman  20 30 40 50 Num Source Models 0.4 0.6 (b) 20 30 40 50 Num Source Models 0.2 0.3 0.4 0.5 20 30 40 50 Num Source Models 0.86 … view at source ↗
Figure 17
Figure 17. Figure 17: Ablations on IRT dimensionality p ∈ {1, 5, 10} for gp-IRT variants on the continuous experiments in § 3.3. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Ablations on IRT dimensionality p ∈ {1, 5, 10} for gp-IRT variants on the pass@k experiments in § 3.3. H Varying the number of source models In [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Results from § 3.2. (a) M = 32. (b) 10% coresets. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Results from § 3.1. (a) M = 15. (b) 10% coresets. I Pearson’s correlation coefficient results We do not report Pearson correlation coefficients between true and predicted scores of models in the main text because it is more informative to report ranking correlation coefficients τ and ρ which are not affected by the scale of scores. However, in [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Pearson correlation coefficients on binary experiments in § 3.1. [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Pearson correlation coefficients on continuous non-pass@ [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Pearson correlation coefficients on pass@ [PITH_FULL_IMAGE:figures/full_fig_p033_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: (a) Φˆ stability on methods in binary experiments, as shown in [PITH_FULL_IMAGE:figures/full_fig_p033_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Question difficulty distribution histograms on various datasets for coresets generated using [PITH_FULL_IMAGE:figures/full_fig_p034_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: True versus predicted scores on ARC-Challenge, with each row representing a different [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: True versus predicted scores on more binary datasets (the same as in Fig. 25) across five [PITH_FULL_IMAGE:figures/full_fig_p036_27.png] view at source ↗
read the original abstract

Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark's questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchmarking methods can be greatly improved by simply using kernel ridge regression at the prediction stage. Additionally, using an information-theoretic feature-selection algorithm called minimum redundancy maximum relevance (mRMR), we can further improve upon these methods by selecting question subsets that will be maximally useful for prediction. Except in very data-poor settings, these approaches consistently achieve smaller prediction errors (in both MAE and RMSE), and greater ranking correlation between predicted and true scores (in both Spearman $\rho$ and Kendall $\tau$) across a range of benchmarks using both binary and continuous metrics. Furthermore, mRMR subsampling is much faster than competitor methods (which often involve fitting probabilistic models or running clustering algorithms), and is more likely to select the same questions under different random seeds or training data splits. Tutorial code can be found at https://github.com/sambowyer/mrmr_eval .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reframes efficient LLM benchmarking as a multiple regression task with feature selection. It claims that replacing existing predictors with kernel ridge regression (KRR) and using minimum redundancy maximum relevance (mRMR) to select question subsets yields lower MAE and RMSE, higher Spearman ρ and Kendall τ, faster computation, and greater stability across seeds and splits than prior methods, except in very data-poor regimes. Tutorial code is provided.

Significance. If the empirical claims hold under proper controls, the work supplies a simple, computationally lightweight alternative to clustering or probabilistic-model-based subsampling for efficient benchmarking. The emphasis on reproducibility via public code is a strength.

major comments (2)
  1. [Experimental evaluation (implicit in abstract claims)] The central claim that mRMR-selected subsets plus KRR produce unbiased predictions for models outside the training distribution is load-bearing. The abstract and methods description give no indication that experiments were stratified by model family, architecture, or training corpus; mRMR relevance scores are computed from full-benchmark scores of the training models, so any architecture-correlated question difficulty can be absorbed by the kernel and produce systematic over- or under-prediction on unseen families. Without such a split, the reported improvements cannot be taken as general.
  2. [Results and discussion] The statement that improvements hold 'except in very data-poor settings' is not accompanied by a concrete definition of that regime or by ablation tables showing the transition point. This boundary is central to the practical recommendation yet remains unquantified.
minor comments (2)
  1. [Abstract] The abstract asserts consistent gains in four metrics but supplies no dataset sizes, number of models, error bars, or baseline descriptions; these details must appear in the main text with explicit controls for post-hoc model selection.
  2. [Methods] Notation for the regression target (full-benchmark score) and the feature matrix (question-level scores) should be introduced once and used consistently; the current description leaves the precise mapping from binary/continuous metrics to the regression problem implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify important gaps in experimental controls and result quantification. We respond to each below and commit to revisions that directly address the concerns.

read point-by-point responses
  1. Referee: [Experimental evaluation (implicit in abstract claims)] The central claim that mRMR-selected subsets plus KRR produce unbiased predictions for models outside the training distribution is load-bearing. The abstract and methods description give no indication that experiments were stratified by model family, architecture, or training corpus; mRMR relevance scores are computed from full-benchmark scores of the training models, so any architecture-correlated question difficulty can be absorbed by the kernel and produce systematic over- or under-prediction on unseen families. Without such a split, the reported improvements cannot be taken as general.

    Authors: We agree that the absence of explicit family- or architecture-stratified hold-outs limits the strength of claims about generalization to entirely unseen model distributions. Our model pool is diverse, but relevance scores and kernels were fit without such separation. In revision we will add a new set of experiments that hold out entire model families (e.g., all Llama variants, all Mistral variants) during both mRMR selection and KRR training, then report MAE, RMSE, Spearman ρ and Kendall τ on the held-out families. We will also discuss the scope of the current results as applying within the observed model distribution. revision: yes

  2. Referee: [Results and discussion] The statement that improvements hold 'except in very data-poor settings' is not accompanied by a concrete definition of that regime or by ablation tables showing the transition point. This boundary is central to the practical recommendation yet remains unquantified.

    Authors: We accept that the qualifier 'very data-poor settings' is imprecise and unsupported by quantitative thresholds or ablations. In the revised manuscript we will (i) define the regime explicitly (training sets of fewer than 30 models for the largest benchmarks, scaled proportionally for smaller ones), (ii) add ablation tables that vary the number of training models from 10 to the full set while holding the question subset fixed, and (iii) mark the point at which the proposed KRR+mRMR method ceases to outperform the baselines in each metric. revision: yes

Circularity Check

0 steps flagged

Standard regression pipeline with independent empirical validation

full rationale

The paper reframes efficient benchmarking as multiple regression plus feature selection and reports that KRR + mRMR yields lower MAE/RMSE and higher Spearman/Kendall correlations than prior methods across benchmarks. All performance numbers are obtained from explicit train/test splits on held-out model scores; the reported improvements are therefore measured quantities, not quantities forced by the fitting procedure itself. No equations equate a claimed prediction to its own training targets by construction, no uniqueness theorem is imported from self-citation, and the mRMR step is a standard information-theoretic algorithm whose output is not presupposed by the evaluation metrics. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms or invented entities.

pith-pipeline@v0.9.1-grok · 5728 in / 1074 out tokens · 32419 ms · 2026-06-29T20:18:02.133033+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 28 canonical work pages

  1. [1]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag : Can a Machine Really Finish Your Sentence ? In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 4791--4800, Florence, Italy, July 2019. Association for Computational Ling...

  2. [2]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding . In International Conference on Learning Representations , 2021 a . URL https://openreview.net/forum?id=d7KBjmI3GmQ

  3. [3]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  4. [4]

    Manning, Christopher Re, Diana Acosta-Navas, Drew A

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Re, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

  5. [5]

    Open LLM Leaderboard v2, 2024

    Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open LLM Leaderboard v2, 2024. URL https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

  6. [6]

    Anchor points: Benchmarking models with much fewer examples

    Rajan Vivek, Kawin Ethayarajh, Diyi Yang, and Douwe Kiela. Anchor points: Benchmarking models with much fewer examples. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 1576--1601, 2024. URL https://aclanthology.org/anthology-files/pdf/eacl/2024.eacl-long.95.pdf

  7. [7]

    tinyBenchmarks : evaluating LLMs with fewer examples

    Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinyBenchmarks : evaluating LLMs with fewer examples. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning , vol...

  8. [8]

    Schulze Buschoff, and Eric Schulz

    Alex Kipnis, Konstantinos Voudouris, Luca M. Schulze Buschoff, and Eric Schulz. metabench - A Sparse Benchmark of Reasoning and Knowledge in Large Language Models . In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=4T33izzFpK

  9. [9]

    Confident Rankings with Fewer Items : Adaptive LLM Evaluation with Continuous Scores , 2026

    Esma Balkır, Alice Pernthaller, Marco Basaldella, José Hernández-Orallo, and Nigel Collier. Confident Rankings with Fewer Items : Adaptive LLM Evaluation with Continuous Scores , 2026. URL https://arxiv.org/pdf/2601.13885

  10. [10]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE -bench: Can Language Models Resolve Real -world Github Issues ? In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=VTF8yNQM66

  11. [11]

    TurnBench - MS : A Benchmark for Evaluating Multi - Turn , Multi - Step Reasoning in Large Language Models

    Yiran Zhang, Mo Wang, Xiaoyang Li, Kaixuan Ren, Chencheng Zhu, and Usman Naseem. TurnBench - MS : A Benchmark for Evaluating Multi - Turn , Multi - Step Reasoning in Large Language Models . In Findings of the Association for Computational Linguistics : EMNLP 2025 , pages 19892--19924, 2025 a . doi:10.18653/v1/2025.findings-emnlp.1084. URL http://arxiv.org...

  12. [12]

    Chain-of- Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of- Thought Prompting Elicits Reasoning in Large Language Models . In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems , volume 35, pages 24824--24837. Curran A...

  13. [13]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The Curious Case of Neural Text Degeneration . In International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=rygGQyrFvH

  14. [14]

    The Effect of Sampling Temperature on Problem Solving in Large Language Models

    Matthew Renze. The Effect of Sampling Temperature on Problem Solving in Large Language Models . In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics : EMNLP 2024 , pages 7346--7356, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.finding...

  15. [15]

    Roush, Andreas Kirsch, and Ravid Shwartz-Ziv

    Nguyen Nhat Minh, Andrew Baker, Clement Neo, Allen G. Roush, Andreas Kirsch, and Ravid Shwartz-Ziv. Turning Up the Heat : Min -p Sampling for Creative and Coherent LLM Outputs . In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=FBkpCyujtS

  16. [16]

    Quantifying Language Models ' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying Language Models ' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=RIu5lyNXjT

  17. [17]

    Bell, Anaelia Ovalle, Jesse Dodge, Antoine Bosselut, Koustuv Sinha, and Adina Williams

    Angelika Romanou, Mark Ibrahim, Candace Ross, Chantal Shaib, Kerem Oktar, Samuel J. Bell, Anaelia Ovalle, Jesse Dodge, Antoine Bosselut, Koustuv Sinha, and Adina Williams. Brittlebench: Quantifying LLM robustness via prompt sensitivity, April 2026. URL http://arxiv.org/abs/2603.13285. arXiv:2603.13285 [cs]

  18. [18]

    Language Models are Few - Shot Learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  19. [19]

    Analysis and comparison of feature selection methods towards performance and stability

    Matheus Cezimbra Barbieri, Bruno Iochins Grisci, and Márcio Dorn. Analysis and comparison of feature selection methods towards performance and stability. Expert Systems with Applications, 249: 0 123667, 2024. ISSN 0957-4174. doi:https://doi.org/10.1016/j.eswa.2024.123667. URL https://www.sciencedirect.com/science/article/pii/S0957417424005335

  20. [20]

    Dipti Theng and Kishor K. Bhoyar. Feature selection techniques for machine learning: a survey of more than two decades of research. Knowledge and Information Systems, 66 0 (3): 0 1575--1637, March 2024. ISSN 0219-3116. doi:10.1007/s10115-023-02010-5. URL https://doi.org/10.1007/s10115-023-02010-5

  21. [21]

    Hanchuan Peng, Fuhui Long, and C. Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 0 (8): 0 1226--1238, 2005. doi:10.1109/TPAMI.2005.159

  22. [22]

    Minimum Redundancy Feature Selection From Microarray Gene Expression Data

    Chris Ding and Hanchuan Peng. Minimum Redundancy Feature Selection From Microarray Gene Expression Data . Journal of Bioinformatics and Computational Biology, 03 0 (02): 0 185--205, 2005. doi:10.1142/S0219720005001004. URL https://doi.org/10.1142/S0219720005001004

  23. [23]

    Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform

    Zhenyu Zhao, Radhika Anand, and Mallory Wang. Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform . In 2019 IEEE International Conference on Data Science and Advanced Analytics ( DSAA ) , pages 442--452, 2019. doi:10.1109/DSAA.2019.00059

  24. [24]

    Hoerl and Robert W

    Arthur E. Hoerl and Robert W. Kennard. Ridge Regression : Biased Estimation for Nonorthogonal Problems . Technometrics, 12 0 (1): 0 55--67, 1970. doi:10.1080/00401706.1970.10488634. URL https://doi.org/10.1080/00401706.1970.10488634

  25. [25]

    Ridge Regression Learning Algorithm in Dual Variables

    Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge Regression Learning Algorithm in Dual Variables . In Proceedings of the Fifteenth International Conference on Machine Learning , ICML '98, pages 515--521, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. ISBN 1558605568

  26. [26]

    The Elements of Statistical Learning

    Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning . Springer Series in Statistics . Springer New York Inc., New York, NY, USA, 2001

  27. [27]

    Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip - Bigram Statistics

    Chin-Yew Lin and Franz Josef Och. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip - Bigram Statistics . In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics ( ACL -04) , pages 605--612, Barcelona, Spain, July 2004. doi:10.3115/1218955.1219032. URL https://aclanthology.org/...

  28. [28]

    Moser, Arundhati S

    Brian B. Moser, Arundhati S. Shanbhag, Stanislav Frolov, Federico Raue, Joachim Folz, and Andreas Dengel. A Coreset Selection of Coreset Selection Literature : Introduction and Recent Advances , January 2026. URL http://arxiv.org/abs/2505.17799. arXiv:2505.17799 [cs]

  29. [29]

    NP -completeness of searches for smallest possible feature sets

    Scott Davies and Stuart Russell. NP -completeness of searches for smallest possible feature sets. In AAAI Symposium on Intelligent Relevance , pages 37--39. AAAI Press Menlo Park, 1994

  30. [30]

    Brian C. Ross. Mutual Information between Discrete and Continuous Data Sets . PLOS ONE, 9 0 (2): 0 1--5, February 2014. doi:10.1371/journal.pone.0087357. URL https://doi.org/10.1371/journal.pone.0087357

  31. [31]

    Estimating mutual information

    Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Phys. Rev. E, 69 0 (6): 0 066138, June 2004. doi:10.1103/PhysRevE.69.066138. URL https://link.aps.org/doi/10.1103/PhysRevE.69.066138

  32. [32]

    Efficient Estimation of Mutual Information for Strongly Dependent Variables

    Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Efficient Estimation of Mutual Information for Strongly Dependent Variables . In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics , volume 38 of Proceedings of Machine Learning Research , pages 277--286, San Diego...

  33. [33]

    Frank and Jerome H

    Ildiko E. Frank and Jerome H. Friedman. A Statistical View of Some Chemometrics Regression Tools . Technometrics, 35 0 (2): 0 109--135, 1993. ISSN 00401706. URL http://www.jstor.org/stable/1269656

  34. [34]

    Ridge Regularization : An Essential Concept in Data Science

    Trevor Hastie. Ridge Regularization : An Essential Concept in Data Science . Technometrics, 62 0 (4): 0 426--433, October 2020. ISSN 0040-1706, 1537-2723. doi:10.1080/00401706.2020.1791959. URL https://www.tandfonline.com/doi/full/10.1080/00401706.2020.1791959

  35. [35]

    Boser, Isabelle M

    Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory , COLT '92, pages 144--152, New York, NY, USA, 1992. Association for Computing Machinery. ISBN 089791497X. doi:10.1145/130385.130401. URL https://doi.org/10.1145/130...

  36. [36]

    Regression Shrinkage and Selection Via the Lasso

    Robert Tibshirani. Regression Shrinkage and Selection Via the Lasso . Journal of the Royal Statistical Society: Series B (Methodological), 58 0 (1): 0 267--288, January 1996. ISSN 0035-9246. doi:10.1111/j.2517-6161.1996.tb02080.x. URL https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

  37. [37]

    Lord, M.R

    F.M. Lord, M.R. Novick, and Allan Birnbaum. Statistical theories of mental test scores. Statistical theories of mental test scores. Addison-Wesley, Oxford, England, 1968

  38. [38]

    Item response theory: Parameter estimation techniques

    Frank B Baker and Seock-Ho Kim. Item response theory: Parameter estimation techniques . CRC press, 2004. URL https://www.ime.unicamp.br/ cnaber/Baker_Book.pdf

  39. [39]

    Van Der Linden

    Wim J. Van Der Linden. Handbook of Item Response Theory , Three Volume Set . Chapman and Hall/CRC, Boca Raton, FL : CRC Press, 2015-, 1 edition, February 2018. ISBN 9781315119144. doi:10.1201/9781315119144. URL https://www.taylorfrancis.com/books/9781315119144

  40. [40]

    Building an Evaluation Scale using Item Response Theory

    John P Lalor, Hao Wu, and Hong Yu. Building an Evaluation Scale using Item Response Theory . Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, 2016: 0 648--657, November 2016. doi:10.18653/v1/d16-1062. URL https://europepmc.org/articles/PMC5167538

  41. [41]

    Clustering Examples in Multi - Dataset Benchmarks with Item Response Theory

    Pedro Rodriguez, Phu Mon Htut, John Lalor, and João Sedoc. Clustering Examples in Multi - Dataset Benchmarks with Item Response Theory . In Shabnam Tafreshi, João Sedoc, Anna Rogers, Aleksandr Drozd, Anna Rumshisky, and Arjun Akula, editors, Proceedings of the Third Workshop on Insights from Negative Results in NLP , pages 100--112, Dublin, Ireland, May 2...

  42. [42]

    What Does Your Benchmark Really Measure ? A Framework for Robust Inference of AI Capabilities , 2025

    Nathanael Jo and Ashia Wilson. What Does Your Benchmark Really Measure ? A Framework for Robust Inference of AI Capabilities , 2025. URL https://arxiv.org/pdf/2509.19590

  43. [43]

    Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, and Dieuwke Hupkes

    Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, and Dieuwke Hupkes. Quantifying Variance in Evaluation Benchmarks . In NeurIPS 2024 Workshop on Regulatable ML , 2025. URL https://openreview.net/forum?id=M9dCa4vYgp

  44. [44]

    A Beta Item Response Model for Continuous Bounded Responses

    Yvonnick Noel and Bruno Dauvier. A Beta Item Response Model for Continuous Bounded Responses . Applied Psychological Measurement, 31 0 (1): 0 47--73, 2007. doi:10.1177/0146621605287691. URL https://doi.org/10.1177/0146621605287691

  45. [45]

    Dorner, and Moritz Hardt

    Guanhua Zhang, Florian E. Dorner, and Moritz Hardt. How Benchmark Prediction from Fewer Data Misses the Mark . In The Thirty -ninth Annual Conference on Neural Information Processing Systems , 2025 b . URL https://openreview.net/forum?id=o3bftqj17e

  46. [46]

    Efficient Attentions for Long Document Summarization

    Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient Attentions for Long Document Summarization . In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of ...

  47. [47]

    Overview of the BioLaySumm 2025 Shared Task on Lay Summarization of Biomedical Research Articles and Radiology Reports

    Chenghao Xiao, Kun Zhao, Xiao Wang, Siwei Wu, Sixing Yan, Tomas Goldsack, Sophia Ananiadou, Noura Al Moubayed, Liang Zhan, William Cheung, and Chenghua Lin. Overview of the BioLaySumm 2025 Shared Task on Lay Summarization of Biomedical Research Articles and Radiology Reports . In The 24th Workshop on Biomedical Natural Language Processing and BioNLP Share...

  48. [48]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore : Evaluating Text Generation with BERT . In International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=SkeHuCVFDr

  49. [49]

    J. P. Kincaid, Jr. Fishburne, Rogers Robert P., Chissom Richard L., and Brad S. Derivation of New Readability Formulas ( Automated Readability Index , Fog Count and Flesch Reading Ease Formula ) for Navy Enlisted Personnel :. Technical report, Defense Technical Information Center, Fort Belvoir, VA, February 1975. URL https://apps.dtic.mil/sti/citations/tr...

  50. [50]

    Program Synthesis with Large Language Models , August 2021

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program Synthesis with Large Language Models , August 2021. URL http://arxiv.org/abs/2108.07732. arXiv:2108.07732 [cs]

  51. [51]

    Is Your Code Generated by ChatGPT Really Correct ? Rigorous Evaluation of Large Language Models for Code Generation , 2023

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is Your Code Generated by ChatGPT Really Correct ? Rigorous Evaluation of Large Language Models for Code Generation , 2023. URL https://arxiv.org/abs/2305.01210

  52. [52]

    On Leakage of Code Generation Evaluation Datasets

    Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon, and Matthias Gallé. On Leakage of Code Generation Evaluation Datasets . In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics : EMNLP 2024 , page...

  53. [53]

    Spearman

    C. Spearman. The Proof and Measurement of Association between Two Things . The American Journal of Psychology, 15 0 (1): 0 72--101, 1904. ISSN 00029556. URL http://www.jstor.org/stable/1412159

  54. [54]

    M. G. Kendall. A New Measure of Rank Correlation . Biometrika, 30 0 (1-2): 0 81--93, June 1938. ISSN 0006-3444. doi:10.1093/biomet/30.1-2.81. URL https://doi.org/10.1093/biomet/30.1-2.81

  55. [55]

    On the stability of feature selection algorithms

    Sarah Nogueira, Konstantinos Sechidis, and Gavin Brown. On the stability of feature selection algorithms. Journal of Machine Learning Research, 18 0 (174): 0 1--54, 2018

  56. [56]

    Ppi++: Efficient prediction-powered inference, 2023

    Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. Ppi++: Efficient prediction-powered inference, 2023. URL https://arxiv.org/pdf/2311.01453

  57. [57]

    Active Evaluation Acquisition for Efficient LLM Benchmarking

    Yang Li, Jie Ma, Miguel Ballesteros, Yassine Benajiba, and Graham Horwood. Active Evaluation Acquisition for Efficient LLM Benchmarking . In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learning , volume 267 o...

  58. [58]

    Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge, Maarten Sap, Pang Wei Koh, Chun Wang, Hannaneh Hajishirzi, and Noah A. Smith. Fluid Language Model Benchmarking . In Second Conference on Language Modeling , 2025

  59. [59]

    Beyond One - Size - Fits - All : Tailored Benchmarks for Efficient Evaluation

    Peiwen Yuan, Yueqi Zhang, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, and Kan Li. Beyond One - Size - Fits - All : Tailored Benchmarks for Efficient Evaluation . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 15591--15615, 2025. URL https://...

  60. [60]

    Disco: Diversifying sample condensation for efficient model evaluation, 2025

    Alexander Rubinstein, Benjamin Raible, Martin Gubri, and Seong Joon Oh. Disco: Diversifying sample condensation for efficient model evaluation, 2025. URL https://arxiv.org/pdf/2510.07959

  61. [61]

    SubLIME : Subset Selection via Rank Correlation Prediction for Data - Efficient LLM Evaluation

    Gayathri Saranathan, Cong Xu, Mahammad Parwez Alam, Tarun Kumar, Martin Foltin, Soon Yee Wong, and Suparna Bhattacharya. SubLIME : Subset Selection via Rank Correlation Prediction for Data - Efficient LLM Evaluation . In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Asso...

  62. [62]

    Effieval: Efficient and generalizable model evaluation via capability coverage maximization, 2025

    Yaoning Wang, Jiahao Ying, Yixin Cao, Yubo Ma, and Yugang Jiang. Effieval: Efficient and generalizable model evaluation via capability coverage maximization, 2025. URL https://arxiv.org/pdf/2508.09662

  63. [63]

    BenTo : Benchmark Task Reduction with In - Context Transferability

    Hongyu Zhao, Ming Li, Lichao Sun, and Tianyi Zhou. BenTo : Benchmark Task Reduction with In - Context Transferability . In International Conference on Learning Representations ( ICLR ) , 2025. URL https://openreview.net/forum?id=4798eef078

  64. [64]

    You Don 't Need to Run Every Eval , 2026

    Dimitris Papailiopoulos. You Don 't Need to Run Every Eval , 2026. URL https://github.com/anadim/llm-benchmark-matrix

  65. [65]

    Cheke, and José Hernández-Orallo

    Lorenzo Pacchiardi, Lucy G. Cheke, and José Hernández-Orallo. 100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances. 2024 KDD workshop on Evaluation and Trustworthiness of Generative AI Models, 2024. URL https://arxiv.org/abs/2409.03563

  66. [66]

    How to Select Datapoints for Efficient Human Evaluation of NLG Models ? Transactions of the Association for Computational Linguistics, 13: 0 1789--1811, 2025

    Vilém Zouhar, Peng Cui, and Mrinmaya Sachan. How to Select Datapoints for Efficient Human Evaluation of NLG Models ? Transactions of the Association for Computational Linguistics, 13: 0 1789--1811, 2025. doi:10.1162/tacl.a.60. URL https://aclanthology.org/2025.tacl-1.80/

  67. [67]

    Note on Regression and Inheritance in the Case of Two Parents

    Karl Pearson. Note on Regression and Inheritance in the Case of Two Parents . Proceedings of the Royal Society of London Series I, 58: 0 240--242, January 1895

  68. [68]

    Solutions to instability problems with sequential wrapper-based approaches to feature selection

    Kevin Dunne, Padraig Cunningham, and Francisco Azuaje. Solutions to instability problems with sequential wrapper-based approaches to feature selection. Journal of Machine Learning Research, 1: 0 22, 2002

  69. [69]

    Instruction- Following Evaluation for Large Language Models , November 2023

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction- Following Evaluation for Large Language Models , November 2023. URL http://arxiv.org/abs/2311.07911. arXiv:2311.07911 [cs]

  70. [70]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset . In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track ( Round 2) , 2021 b . URL https://openreview.net/forum?id=7Bywt2mQsCe

  71. [71]

    MMLU - Pro : A More Robust and Challenging Multi - Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU - Pro : A More Robust and Challenging Multi - Task Language Understanding Benchmark . In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paq...

  72. [72]

    Think you have Solved Question Answering ? Try ARC , the AI2 Reasoning Challenge , March 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have Solved Question Answering ? Try ARC , the AI2 Reasoning Challenge , March 2018. URL http://arxiv.org/abs/1803.05457. arXiv:1803.05457 [cs]

  73. [73]

    Challenging BIG - Bench Tasks and Whether Chain -of- Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and Jason Wei. Challenging BIG - Bench Tasks and Whether Chain -of- Thought Can Solve Them . In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistic...

  74. [74]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A Graduate - Level Google - Proof Q & A Benchmark . In First Conference on Language Modeling , 2024. URL https://openreview.net/forum?id=Ti67584b98

  75. [75]

    MuSR : Testing the Limits of Chain -of-thought with Multistep Soft Reasoning

    Zayne Rea Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. MuSR : Testing the Limits of Chain -of-thought with Multistep Soft Reasoning . In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=jenyYQzue1

  76. [76]

    CommonsenseQA : A Question Answering Challenge Targeting Commonsense Knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA : A Question Answering Challenge Targeting Commonsense Knowledge . In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics : Human Language Technologies , ...

  77. [77]

    Training Verifiers to Solve Math Word Problems , November 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems , November 2021. URL http://arxiv.org/abs/2110.14168. arXiv:2110.14168 [cs]

  78. [78]

    LegalBench : A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

    Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan Choi, Kevin Tobia, Mar...

  79. [79]

    What Disease Does This Patient Have ? A Large - Scale Open Domain Question Answering Dataset from Medical Exams

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What Disease Does This Patient Have ? A Large - Scale Open Domain Question Answering Dataset from Medical Exams . Applied Sciences, 11 0 (14), 2021. ISSN 2076-3417. doi:10.3390/app11146421. URL https://www.mdpi.com/2076-3417/11/14/6421

  80. [80]

    Toward a unified framework for data-efficient evaluation of large language models, 2025

    Lele Liao, Qile Zhang, Ruofan Wu, and Guanhua Fang. Toward a unified framework for data-efficient evaluation of large language models, 2025. URL https://arxiv.org/pdf/2510.04051

Showing first 80 references.