Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

Ana Gjorgjevikj; Denica Kjorvezir; Gjorgjina Cenikj; Marko Djukanovi\'c; Tome Eftimov

arxiv: 2606.01400 · v1 · pith:T2T5TCH4new · submitted 2026-05-31 · 💻 cs.CL · cs.AI

Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

Denica Kjorvezir , Marko Djukanovi\'c , Ana Gjorgjevikj , Gjorgjina Cenikj , Tome Eftimov This is my paper

Pith reviewed 2026-06-28 16:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM benchmarksprompt selectionmaximum independent setsimilarity graphsevaluation efficiencyranking consistency

0 comments

The pith

Maximum independent set selection on prompt similarity graphs produces consistent LLM benchmark rankings with fewer prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes modeling LLM benchmarks as graphs where prompts are nodes and edges connect those with high embedding similarity, then selecting a maximum independent set to remove redundant prompts. This approach is tested across multiple benchmarks, embedding models, and solvers to check if the reduced sets produce stable model rankings. The results show high consistency in rankings across different selections and only occasional divergence from the full benchmark. This matters because evaluating LLMs on full benchmarks is costly, so efficient subsets could save resources while maintaining reliability.

Core claim

Repeated application of maximum independent set algorithms to similarity graphs of benchmark prompts yields subsets that produce LLM rankings with Kendall's W at least 0.90 in 99.2% of cases, achieving 25-48% prompt reduction at higher thresholds, with ranking divergence from the full benchmark occurring in only 15.95% of configurations.

What carries the argument

Maximum Independent Set (MIS) algorithms applied to similarity graphs built from embedding-space distances between prompts, with edges for distances above a percentile threshold.

If this is right

LLM rankings remain stable across different random seeds in stochastic MIS solvers.
Subsets can reduce the number of prompts needed by 25 to 48 percent on average at higher thresholds.
Most configurations maintain high correlation (rho >= 0.95) with full benchmark rankings.
Failure modes are concentrated in dense graphs at low thresholds on certain benchmarks like GPQA and IFEval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar graph-based reduction could be applied to other evaluation tasks beyond LLMs if embedding similarity correlates with performance correlation.
Choosing the right percentile threshold is key to balancing reduction and fidelity, suggesting adaptive methods might improve results.
Future work could explore dynamic thresholds or hybrid selection methods to minimize the 15.95% divergence cases.

Load-bearing premise

Embedding-space distance above a percentile threshold indicates that two prompts will produce correlated model behaviors across LLMs, making them functionally redundant.

What would settle it

An experiment showing that for a selected subset, the ranking of LLMs differs substantially from the full benchmark in a way not explained by the reported 15.95% divergence rate, or that Kendall's W falls below 0.90 in many configurations.

Figures

Figures reproduced from arXiv: 2606.01400 by Ana Gjorgjevikj, Denica Kjorvezir, Gjorgjina Cenikj, Marko Djukanovi\'c, Tome Eftimov.

**Figure 2.** Figure 2: Embedding space coverage illustration. (a) Full benchmark: Domain A is overrepresented, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Kendall’s𝑊 (seed concordance) vs. Spearman 𝜌 (agreement with full baseline) for all stochastic configurations, coloured by embedding model. Reference lines at 𝑊 = 0.90, 𝜌 = 0.95 define four quadrants. Most points fall in the upper region (𝑊 ≥ 0.99 in 99.2% of configs), with substantial spread along the 𝑥-axis (varying 𝜌), confirming the hypothesis. (BGE, E5, GIST; 75.7–80.4%; [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 4.** Figure 4: Agreement with full baseline (𝜌, left) and seed consistency (𝑊 , right) as a function of the fraction of prompts selected, by dataset. Trend lines fitted per benchmark and stochastic optimization algorithm. MMLU-Pro and Omni-MATH achieve high quality even with small fractions (p10, p20, etc); IFEval requires larger subsets for comparable quality. distinction, where the MIS selection focuses on a restricted… view at source ↗

**Figure 5.** Figure 5: Distance distributions for dataset GPQA_CHUNKED and different models with selected met [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Distance distributions for dataset GPQA_NOCHUNK and different models with selected [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Distance distributions for dataset GPQA and different (small) models with selected metrics [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Rank shift heatmap for Online-MIS on IFEval (Qwen4b, cosine, p20). Each cell: deviation of [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt selection framework that models each benchmark as a similarity graph -- nodes are prompts connected if their embedding-space distance falls above a configurable threshold -- and applies Maximum Independent Set (MIS) algorithms to select a maximally diverse, non-redundant subset. We evaluate four MIS solvers (CPLEX, GREEDY, Online-MIS, ReduMIS) across six embedding models, three distance measures, six percentile thresholds, and four benchmarks (GPQA, IFEval, MMLU-Pro, Omni-MATH) covering 66 LLMs. Our central hypothesis -- that repeated selection under different random seeds yields consistent LLM rankings that may also differ from the full-benchmark baseline -- is strongly confirmed: Kendall's $W \geq 0.90$ in 99.2\% of stochastic configurations (mean $W = 0.997 \pm 0.008$), while at higher percentile thresholds selected subsets achieve 25--48\% prompt reduction on average. Ranking divergence from the full benchmark ($\rho < 0.95$) occurs in only 15.95\% of configurations, concentrated at low thresholds ($p_{10}$--$p_{20}$) and benchmarks (GPQA, IFEval), identifying overly dense graphs as the primary failure mode.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MIS on prompt embedding graphs gives stable reduced benchmarks with high ranking consistency, but skips testing whether similarity actually marks functional redundancy.

read the letter

The main takeaway is that maximum independent set selection on graphs built from prompt embeddings produces LLM rankings that stay highly consistent across random seeds, with Kendall's W averaging 0.997 in 99.2 percent of the tested configurations, while cutting 25 to 48 percent of prompts at the stricter thresholds.

What the paper actually does is take standard MIS solvers and run them on similarity graphs derived from six different embeddings and three distance measures. They cover four benchmarks, six thresholds, and 66 models total. The scale of that sweep is useful, and they report the failure cases plainly: dense graphs at low thresholds on GPQA and IFEval are where divergence from the full benchmark shows up.

The soft spot is the untested premise that embedding proximity equals redundant evaluation signal. The work measures how stable the selected subsets are under reseeding and how often they preserve overall rankings, but it does not check whether models' performance vectors actually correlate on prompts that the graph treats as similar. Without that link, the consistency numbers could reflect a repeatable selection process more than preserved benchmark information.

This is aimed at groups that run repeated LLM evaluations and need cheaper subsets without changing model orderings much. Readers working on benchmark pruning or cost reduction will find the concrete numbers and solver comparisons worth trying. It has enough empirical grounding and clear reporting to merit peer review, even though the proxy assumption will need more direct checks.

Referee Report

2 major / 2 minor

Summary. The paper proposes modeling LLM benchmarks as similarity graphs where prompts are nodes connected if their embedding distances exceed a percentile threshold, then using MIS solvers (CPLEX, GREEDY, Online-MIS, ReduMIS) to extract diverse subsets. Across six embedding models, three distance measures, six thresholds, and four benchmarks (GPQA, IFEval, MMLU-Pro, Omni-MATH) with 66 LLMs, it reports that stochastic MIS selections produce highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of configurations, mean W = 0.997 ± 0.008) while achieving 25–48% prompt reduction at higher thresholds, with ranking divergence from the full benchmark (ρ < 0.95) occurring in only 15.95% of cases, mainly at low thresholds.

Significance. If the embedding-distance proxy for functional redundancy holds, the method offers a practical route to cheaper yet reliable LLM evaluation with explicit failure modes identified. The work earns credit for its large-scale, multi-factor experimental design (four solvers × six embeddings × three distances × six thresholds × four benchmarks) and direct reporting of both consistency metrics and divergence rates, which supports reproducibility of the core empirical claims.

major comments (2)

[Similarity graph construction and hypothesis statement (abstract and §3)] The central hypothesis and all reported consistency/reduction results rest on the unvalidated premise that embedding-space distance above a percentile threshold indicates functional redundancy (i.e., correlated LLM behavior across the 66 models). No analysis is provided that correlates embedding distances with pairwise Spearman correlations of per-LLM accuracy vectors, leaving open the possibility that high Kendall's W reflects stability of an arbitrary selection process rather than preservation of evaluation signal.
[Results on ranking divergence] Table/figure reporting ranking divergence (15.95% of configurations with ρ < 0.95) measures agreement between subset and full-benchmark rankings but does not test whether the selected prompts actually preserve the relative ordering that would arise from non-redundant prompts; this makes the claim that divergence is "concentrated at low thresholds" dependent on the same untested proxy.

minor comments (2)

[Experimental protocol] The exact procedure for handling ties or disconnected components in the MIS solvers should be stated explicitly, as it affects reproducibility of the 99.2% consistency figure.
[Methodology] Clarify whether the six percentile thresholds are computed per-benchmark or globally, and report the resulting graph densities for each combination.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify a key area for strengthening the validation of our central hypothesis. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Similarity graph construction and hypothesis statement (abstract and §3)] The central hypothesis and all reported consistency/reduction results rest on the unvalidated premise that embedding-space distance above a percentile threshold indicates functional redundancy (i.e., correlated LLM behavior across the 66 models). No analysis is provided that correlates embedding distances with pairwise Spearman correlations of per-LLM accuracy vectors, leaving open the possibility that high Kendall's W reflects stability of an arbitrary selection process rather than preservation of evaluation signal.

Authors: We agree that the manuscript would be strengthened by an explicit validation of the embedding-distance proxy. In the revision we will add a new analysis that computes, for each benchmark, the Spearman correlation between pairwise embedding distances and the corresponding pairwise Spearman correlations of the per-LLM accuracy vectors across the 66 models. We will report these correlations (stratified by embedding model, distance measure, and threshold) and discuss their relationship to the observed Kendall's W values, thereby testing whether the graph construction captures functional redundancy rather than merely producing stable but arbitrary subsets. revision: yes
Referee: [Results on ranking divergence] Table/figure reporting ranking divergence (15.95% of configurations with ρ < 0.95) measures agreement between subset and full-benchmark rankings but does not test whether the selected prompts actually preserve the relative ordering that would arise from non-redundant prompts; this makes the claim that divergence is "concentrated at low thresholds" dependent on the same untested proxy.

Authors: We acknowledge that the divergence analysis inherits the same limitation. In the revised manuscript we will integrate the new correlation analysis described above with the existing divergence results. Specifically, we will examine whether configurations exhibiting low embedding-performance correlation are also those showing higher divergence rates, and we will qualify the statement that divergence is concentrated at low thresholds by referencing the strength of the proxy correlation in those regimes. This will provide a more direct link between the proxy validity and the observed divergence patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ranking-consistency measurements on input graphs

full rationale

The paper constructs similarity graphs from embedding distances at chosen percentile thresholds, then runs MIS solvers and measures Kendall's W on the resulting LLM rankings across random seeds. These W values and reduction percentages are direct empirical outputs; no equation defines a target quantity in terms of a fitted parameter that is then re-measured, and no self-citation supplies a load-bearing uniqueness theorem or ansatz. The embedding-to-redundancy premise is an unvalidated modeling assumption rather than a definitional loop. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The central claim rests on several configurable choices (percentile thresholds, embedding models, distance functions) that are exhaustively tested rather than derived; no new entities are postulated.

free parameters (3)

percentile threshold p
Six values tested; controls edge density in the similarity graph and directly affects subset size and ranking divergence.
embedding model
Six different models used to compute node features; choice affects which prompts are deemed similar.
distance measure
Three measures tested; alters the similarity graph construction.

axioms (2)

domain assumption Embedding proximity implies functional redundancy for LLM scoring
Invoked when edges are added above the threshold and when claiming the selected subset remains representative.
standard math MIS solvers produce valid independent sets on the constructed graphs
Standard property of the four cited algorithms (CPLEX, GREEDY, Online-MIS, ReduMIS).

pith-pipeline@v0.9.1-grok · 5802 in / 1431 out tokens · 25936 ms · 2026-06-28T16:54:20.296334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 19 canonical work pages · 13 internal anchors

[1]

Anand, R., Aggarwal, D., and Kumar, V. (2017). A comparative analysis of optimization solvers. Journal of Statistics and Management Systems , 20(4):623--635

2017
[2]

Babakhin, Y., Osmulski, R., Ak, R., Moreira, G., Xu, M., Schifferer, B., Liu, B., and Oldridge, E. (2025). Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks. arXiv preprint arXiv:2511.07025

work page arXiv 2025
[3]

Cenikj, G., Dieter Lang, R., Petrus Engelbrecht, A., Doerr, C., Korošec, P., and Eftimov, T. (2022). SELECTOR: Selecting a Representative Benchmark Suite for Reproducible Statistical Comparison . In Proc. of Genetic and Evolutionary Computation Conference (GECCO)

2022
[4]

Cenikj, G., Petelin, G., and Eftimov, T. (2024). Transoptas: Transformer-based algorithm selection for single-objective optimization. In Proceedings of the Genetic and Evolutionary Computation Conference Companion , GECCO '24 Companion, page 403–406, New York, NY, USA. Association for Computing Machinery

2024
[5]

Chen, J., Wang, C., Zhang, G., Ye, P., Bai, L., Hu, W., Qu, Y., and Hu, S. (2025). Learning compact representations of llm abilities via item response theory. arXiv preprint arXiv:2510.00844

work page arXiv 2025
[6]

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Dahlum, J., Lamm, S., Sanders, P., Schulz, C., Strash, D., and Werneck, R. F. (2016). Accelerating local search for the maximum independent set problem. In 15th International Symposium on Experimental Algorithms SEA , volume 9685 of Lecture Notes in Computer Science , pages 118--133. Springer

2016
[8]

Eftimov, T., Petelin, G., Cenikj, G., Kostovska, A., Ispirova, G., Korošec, P., and Bogatinovski, J. (2022). Less is more: Selecting the right benchmarking set of data for time series classification. Expert Systems with Applications , 198:116871

2022
[9]

Gao, B., Song, F., Yang, Z., Cai, Z., Miao, Y., Dong, Q., Li, L., Ma, C., Chen, L., Tang, Z., et al. (2025). Omni-math: A universal olympiad level mathematic benchmark for large language models. In International Conference on Learning Representations , volume 2025, pages 100540--100569

2025
[10]

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. A. (2018). Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 107--112

2018
[12]

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. (2021). Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. (2023). Mistral 7b. arXiv preprint arXiv:2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Kostovska, A., Cenikj, G., Vermetten, D., Jankovic, A., Nikolikj, A., Skvorc, U., Korosec, P., Doerr, C., and Eftimov, T. (2023). Ps-aas: Portfolio selection for automated algorithm selection in black-box optimization. In International Conference on Automated Machine Learning , pages 11--1. PMLR

2023
[15]

Lamm, S., Sanders, P., Schulz, C., Strash, D., and Werneck, R. F. (2017). Finding near-optimal independent sets at scale. J. Heuristics , 23(4):207--229

2017
[16]

Landis, J. R. and Koch, G. G. (1977). The measurement of observer agreement for categorical data. biometrics , pages 159--174

1977
[17]

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al. (2022). Holistic evaluation of language models. arXiv preprint arXiv:2211.09110

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. (2024). Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Mizrahi, M., Kaplan, G., Malkin, D., Dror, R., Shahaf, D., and Stanovsky, G. (2024). State of what art? a call for multi-prompt LLM evaluation. Transactions of the Association for Computational Linguistics , 12:933--949

2024
[20]

Perlitz, Y., Bandel, E., Gera, A., Arviv, O., Ein-Dor, L., Shnarch, E., Slonim, N., Shmueli-Scheuer, M., and Choshen, L. (2023). Efficient benchmarking of language models. arXiv preprint arXiv:2308.11696

work page arXiv 2023
[21]

tinybenchmarks: Evaluating LLMs with fewer examples

Polo, F. M., Weber, L., Choshen, L., Sun, Y., Xu, G., and Yurochkin, M. (2024). tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992

work page arXiv 2024
[22]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. (2023). Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Schilling-Wilhelmi, M., Alampara, N., and Jablonka, K. M. (2025). Lifting the benchmark iceberg with item-response theory. In AI for Accelerated Materials Design-ICLR 2025

2025
[24]

Solatorio, A. V. (2024). Gistembed: Guided in-sample selection of training negatives for text embedding fine-tuning. arXiv preprint arXiv:2402.16829

work page arXiv 2024
[25]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Truong, S., Tu, Y., Liang, P., Li, B., and Koyejo, S. (2025). Reliable and efficient amortized model-based evaluation. arXiv preprint arXiv:2503.13335

work page arXiv 2025
[27]

Vivek, R., Ethayarajh, K., Yang, D., and Kiela, D. (2024). Anchor points: Benchmarking models with much fewer examples. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1576--1601

2024
[28]

Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F. (2022). Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., et al. (2024). Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems , 37:95266--95290

2024
[30]

White, C., Dooley, S., Roberts, M., Pal, A., Feuer, B., Jain, S., Shwartz-Ziv, R., Jain, N., Saifullah, K., Naidu, S., et al. (2024). Livebench: A challenging, contamination-free llm benchmark. arXiv preprint arXiv:2406.19314 , 4:2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Xiao, S., Liu, Z., Zhang, P., Muennighoff, N., Lian, D., and Nie, J.-Y. (2024). C-pack: Packed resources for general chinese embeddings. In Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval , pages 641--649

2024
[32]

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. (2025). Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

E., and Hardt, M

Zhang, G., Dorner, F. E., and Hardt, M. (2026). How benchmark prediction from fewer data misses the mark. Advances in Neural Information Processing Systems , 38:135638--135678

2026
[34]

Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., et al. (2025). Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems , 36:46595--46623

2023
[36]

Zhou, H., Huang, H., Zhao, Z., Han, L., Wang, H., Chen, K., Yang, M., Bao, W., Dong, J., Xu, B., et al. (2026). Lost in benchmarks? rethinking large language model benchmarking with item response theory. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, pages 35085--35093

2026
[37]

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., and Hou, L. (2023). Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Anand, R., Aggarwal, D., and Kumar, V. (2017). A comparative analysis of optimization solvers. Journal of Statistics and Management Systems , 20(4):623--635

2017

[2] [2]

Babakhin, Y., Osmulski, R., Ak, R., Moreira, G., Xu, M., Schifferer, B., Liu, B., and Oldridge, E. (2025). Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks. arXiv preprint arXiv:2511.07025

work page arXiv 2025

[3] [3]

Cenikj, G., Dieter Lang, R., Petrus Engelbrecht, A., Doerr, C., Korošec, P., and Eftimov, T. (2022). SELECTOR: Selecting a Representative Benchmark Suite for Reproducible Statistical Comparison . In Proc. of Genetic and Evolutionary Computation Conference (GECCO)

2022

[4] [4]

Cenikj, G., Petelin, G., and Eftimov, T. (2024). Transoptas: Transformer-based algorithm selection for single-objective optimization. In Proceedings of the Genetic and Evolutionary Computation Conference Companion , GECCO '24 Companion, page 403–406, New York, NY, USA. Association for Computing Machinery

2024

[5] [5]

Chen, J., Wang, C., Zhang, G., Ye, P., Bai, L., Hu, W., Qu, Y., and Hu, S. (2025). Learning compact representations of llm abilities via item response theory. arXiv preprint arXiv:2510.00844

work page arXiv 2025

[6] [6]

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Dahlum, J., Lamm, S., Sanders, P., Schulz, C., Strash, D., and Werneck, R. F. (2016). Accelerating local search for the maximum independent set problem. In 15th International Symposium on Experimental Algorithms SEA , volume 9685 of Lecture Notes in Computer Science , pages 118--133. Springer

2016

[8] [8]

Eftimov, T., Petelin, G., Cenikj, G., Kostovska, A., Ispirova, G., Korošec, P., and Bogatinovski, J. (2022). Less is more: Selecting the right benchmarking set of data for time series classification. Expert Systems with Applications , 198:116871

2022

[9] [9]

Gao, B., Song, F., Yang, Z., Cai, Z., Miao, Y., Dong, Q., Li, L., Ma, C., Chen, L., Tang, Z., et al. (2025). Omni-math: A universal olympiad level mathematic benchmark for large language models. In International Conference on Learning Representations , volume 2025, pages 100540--100569

2025

[10] [10]

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. A. (2018). Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 107--112

2018

[12] [12]

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. (2021). Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. (2023). Mistral 7b. arXiv preprint arXiv:2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Kostovska, A., Cenikj, G., Vermetten, D., Jankovic, A., Nikolikj, A., Skvorc, U., Korosec, P., Doerr, C., and Eftimov, T. (2023). Ps-aas: Portfolio selection for automated algorithm selection in black-box optimization. In International Conference on Automated Machine Learning , pages 11--1. PMLR

2023

[15] [15]

Lamm, S., Sanders, P., Schulz, C., Strash, D., and Werneck, R. F. (2017). Finding near-optimal independent sets at scale. J. Heuristics , 23(4):207--229

2017

[16] [16]

Landis, J. R. and Koch, G. G. (1977). The measurement of observer agreement for categorical data. biometrics , pages 159--174

1977

[17] [17]

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al. (2022). Holistic evaluation of language models. arXiv preprint arXiv:2211.09110

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. (2024). Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Mizrahi, M., Kaplan, G., Malkin, D., Dror, R., Shahaf, D., and Stanovsky, G. (2024). State of what art? a call for multi-prompt LLM evaluation. Transactions of the Association for Computational Linguistics , 12:933--949

2024

[20] [20]

Perlitz, Y., Bandel, E., Gera, A., Arviv, O., Ein-Dor, L., Shnarch, E., Slonim, N., Shmueli-Scheuer, M., and Choshen, L. (2023). Efficient benchmarking of language models. arXiv preprint arXiv:2308.11696

work page arXiv 2023

[21] [21]

tinybenchmarks: Evaluating LLMs with fewer examples

Polo, F. M., Weber, L., Choshen, L., Sun, Y., Xu, G., and Yurochkin, M. (2024). tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992

work page arXiv 2024

[22] [22]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. (2023). Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Schilling-Wilhelmi, M., Alampara, N., and Jablonka, K. M. (2025). Lifting the benchmark iceberg with item-response theory. In AI for Accelerated Materials Design-ICLR 2025

2025

[24] [24]

Solatorio, A. V. (2024). Gistembed: Guided in-sample selection of training negatives for text embedding fine-tuning. arXiv preprint arXiv:2402.16829

work page arXiv 2024

[25] [25]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Truong, S., Tu, Y., Liang, P., Li, B., and Koyejo, S. (2025). Reliable and efficient amortized model-based evaluation. arXiv preprint arXiv:2503.13335

work page arXiv 2025

[27] [27]

Vivek, R., Ethayarajh, K., Yang, D., and Kiela, D. (2024). Anchor points: Benchmarking models with much fewer examples. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1576--1601

2024

[28] [28]

Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F. (2022). Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., et al. (2024). Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems , 37:95266--95290

2024

[30] [30]

White, C., Dooley, S., Roberts, M., Pal, A., Feuer, B., Jain, S., Shwartz-Ziv, R., Jain, N., Saifullah, K., Naidu, S., et al. (2024). Livebench: A challenging, contamination-free llm benchmark. arXiv preprint arXiv:2406.19314 , 4:2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Xiao, S., Liu, Z., Zhang, P., Muennighoff, N., Lian, D., and Nie, J.-Y. (2024). C-pack: Packed resources for general chinese embeddings. In Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval , pages 641--649

2024

[32] [32]

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. (2025). Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

E., and Hardt, M

Zhang, G., Dorner, F. E., and Hardt, M. (2026). How benchmark prediction from fewer data misses the mark. Advances in Neural Information Processing Systems , 38:135638--135678

2026

[34] [34]

Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., et al. (2025). Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems , 36:46595--46623

2023

[36] [36]

Zhou, H., Huang, H., Zhao, Z., Han, L., Wang, H., Chen, K., Yang, M., Bao, W., Dong, J., Xu, B., et al. (2026). Lost in benchmarks? rethinking large language model benchmarking with item response theory. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, pages 35085--35093

2026

[37] [37]

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., and Hou, L. (2023). Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911

work page internal anchor Pith review Pith/arXiv arXiv 2023