pith. sign in

arxiv: 2606.01400 · v1 · pith:T2T5TCH4new · submitted 2026-05-31 · 💻 cs.CL · cs.AI

Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

Pith reviewed 2026-06-28 16:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM benchmarksprompt selectionmaximum independent setsimilarity graphsevaluation efficiencyranking consistency
0
0 comments X

The pith

Maximum independent set selection on prompt similarity graphs produces consistent LLM benchmark rankings with fewer prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes modeling LLM benchmarks as graphs where prompts are nodes and edges connect those with high embedding similarity, then selecting a maximum independent set to remove redundant prompts. This approach is tested across multiple benchmarks, embedding models, and solvers to check if the reduced sets produce stable model rankings. The results show high consistency in rankings across different selections and only occasional divergence from the full benchmark. This matters because evaluating LLMs on full benchmarks is costly, so efficient subsets could save resources while maintaining reliability.

Core claim

Repeated application of maximum independent set algorithms to similarity graphs of benchmark prompts yields subsets that produce LLM rankings with Kendall's W at least 0.90 in 99.2% of cases, achieving 25-48% prompt reduction at higher thresholds, with ranking divergence from the full benchmark occurring in only 15.95% of configurations.

What carries the argument

Maximum Independent Set (MIS) algorithms applied to similarity graphs built from embedding-space distances between prompts, with edges for distances above a percentile threshold.

If this is right

  • LLM rankings remain stable across different random seeds in stochastic MIS solvers.
  • Subsets can reduce the number of prompts needed by 25 to 48 percent on average at higher thresholds.
  • Most configurations maintain high correlation (rho >= 0.95) with full benchmark rankings.
  • Failure modes are concentrated in dense graphs at low thresholds on certain benchmarks like GPQA and IFEval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar graph-based reduction could be applied to other evaluation tasks beyond LLMs if embedding similarity correlates with performance correlation.
  • Choosing the right percentile threshold is key to balancing reduction and fidelity, suggesting adaptive methods might improve results.
  • Future work could explore dynamic thresholds or hybrid selection methods to minimize the 15.95% divergence cases.

Load-bearing premise

Embedding-space distance above a percentile threshold indicates that two prompts will produce correlated model behaviors across LLMs, making them functionally redundant.

What would settle it

An experiment showing that for a selected subset, the ranking of LLMs differs substantially from the full benchmark in a way not explained by the reported 15.95% divergence rate, or that Kendall's W falls below 0.90 in many configurations.

Figures

Figures reproduced from arXiv: 2606.01400 by Ana Gjorgjevikj, Denica Kjorvezir, Gjorgjina Cenikj, Marko Djukanovi\'c, Tome Eftimov.

Figure 1
Figure 1. Figure 1: The MIS-based prompt selection pipeline. Prompts are encoded, a similarity graph is con [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Embedding space coverage illustration. (a) Full benchmark: Domain A is overrepresented, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Kendall’s𝑊 (seed concordance) vs. Spearman 𝜌 (agreement with full baseline) for all stochas￾tic configurations, coloured by embedding model. Reference lines at 𝑊 = 0.90, 𝜌 = 0.95 define four quadrants. Most points fall in the upper region (𝑊 ≥ 0.99 in 99.2% of configs), with substantial spread along the 𝑥-axis (varying 𝜌), confirming the hypothesis. (BGE, E5, GIST; 75.7–80.4%; [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 4
Figure 4. Figure 4: Agreement with full baseline (𝜌, left) and seed consistency (𝑊 , right) as a function of the fraction of prompts selected, by dataset. Trend lines fitted per benchmark and stochastic optimization algorithm. MMLU-Pro and Omni-MATH achieve high quality even with small fractions (p10, p20, etc); IFEval requires larger subsets for comparable quality. distinction, where the MIS selection focuses on a restricted… view at source ↗
Figure 5
Figure 5. Figure 5: Distance distributions for dataset GPQA_CHUNKED and different models with selected met [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distance distributions for dataset GPQA_NOCHUNK and different models with selected [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distance distributions for dataset GPQA and different (small) models with selected metrics [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Rank shift heatmap for Online-MIS on IFEval (Qwen4b, cosine, p20). Each cell: deviation of [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt selection framework that models each benchmark as a similarity graph -- nodes are prompts connected if their embedding-space distance falls above a configurable threshold -- and applies Maximum Independent Set (MIS) algorithms to select a maximally diverse, non-redundant subset. We evaluate four MIS solvers (CPLEX, GREEDY, Online-MIS, ReduMIS) across six embedding models, three distance measures, six percentile thresholds, and four benchmarks (GPQA, IFEval, MMLU-Pro, Omni-MATH) covering 66 LLMs. Our central hypothesis -- that repeated selection under different random seeds yields consistent LLM rankings that may also differ from the full-benchmark baseline -- is strongly confirmed: Kendall's $W \geq 0.90$ in 99.2\% of stochastic configurations (mean $W = 0.997 \pm 0.008$), while at higher percentile thresholds selected subsets achieve 25--48\% prompt reduction on average. Ranking divergence from the full benchmark ($\rho < 0.95$) occurs in only 15.95\% of configurations, concentrated at low thresholds ($p_{10}$--$p_{20}$) and benchmarks (GPQA, IFEval), identifying overly dense graphs as the primary failure mode.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes modeling LLM benchmarks as similarity graphs where prompts are nodes connected if their embedding distances exceed a percentile threshold, then using MIS solvers (CPLEX, GREEDY, Online-MIS, ReduMIS) to extract diverse subsets. Across six embedding models, three distance measures, six thresholds, and four benchmarks (GPQA, IFEval, MMLU-Pro, Omni-MATH) with 66 LLMs, it reports that stochastic MIS selections produce highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of configurations, mean W = 0.997 ± 0.008) while achieving 25–48% prompt reduction at higher thresholds, with ranking divergence from the full benchmark (ρ < 0.95) occurring in only 15.95% of cases, mainly at low thresholds.

Significance. If the embedding-distance proxy for functional redundancy holds, the method offers a practical route to cheaper yet reliable LLM evaluation with explicit failure modes identified. The work earns credit for its large-scale, multi-factor experimental design (four solvers × six embeddings × three distances × six thresholds × four benchmarks) and direct reporting of both consistency metrics and divergence rates, which supports reproducibility of the core empirical claims.

major comments (2)
  1. [Similarity graph construction and hypothesis statement (abstract and §3)] The central hypothesis and all reported consistency/reduction results rest on the unvalidated premise that embedding-space distance above a percentile threshold indicates functional redundancy (i.e., correlated LLM behavior across the 66 models). No analysis is provided that correlates embedding distances with pairwise Spearman correlations of per-LLM accuracy vectors, leaving open the possibility that high Kendall's W reflects stability of an arbitrary selection process rather than preservation of evaluation signal.
  2. [Results on ranking divergence] Table/figure reporting ranking divergence (15.95% of configurations with ρ < 0.95) measures agreement between subset and full-benchmark rankings but does not test whether the selected prompts actually preserve the relative ordering that would arise from non-redundant prompts; this makes the claim that divergence is "concentrated at low thresholds" dependent on the same untested proxy.
minor comments (2)
  1. [Experimental protocol] The exact procedure for handling ties or disconnected components in the MIS solvers should be stated explicitly, as it affects reproducibility of the 99.2% consistency figure.
  2. [Methodology] Clarify whether the six percentile thresholds are computed per-benchmark or globally, and report the resulting graph densities for each combination.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify a key area for strengthening the validation of our central hypothesis. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Similarity graph construction and hypothesis statement (abstract and §3)] The central hypothesis and all reported consistency/reduction results rest on the unvalidated premise that embedding-space distance above a percentile threshold indicates functional redundancy (i.e., correlated LLM behavior across the 66 models). No analysis is provided that correlates embedding distances with pairwise Spearman correlations of per-LLM accuracy vectors, leaving open the possibility that high Kendall's W reflects stability of an arbitrary selection process rather than preservation of evaluation signal.

    Authors: We agree that the manuscript would be strengthened by an explicit validation of the embedding-distance proxy. In the revision we will add a new analysis that computes, for each benchmark, the Spearman correlation between pairwise embedding distances and the corresponding pairwise Spearman correlations of the per-LLM accuracy vectors across the 66 models. We will report these correlations (stratified by embedding model, distance measure, and threshold) and discuss their relationship to the observed Kendall's W values, thereby testing whether the graph construction captures functional redundancy rather than merely producing stable but arbitrary subsets. revision: yes

  2. Referee: [Results on ranking divergence] Table/figure reporting ranking divergence (15.95% of configurations with ρ < 0.95) measures agreement between subset and full-benchmark rankings but does not test whether the selected prompts actually preserve the relative ordering that would arise from non-redundant prompts; this makes the claim that divergence is "concentrated at low thresholds" dependent on the same untested proxy.

    Authors: We acknowledge that the divergence analysis inherits the same limitation. In the revised manuscript we will integrate the new correlation analysis described above with the existing divergence results. Specifically, we will examine whether configurations exhibiting low embedding-performance correlation are also those showing higher divergence rates, and we will qualify the statement that divergence is concentrated at low thresholds by referencing the strength of the proxy correlation in those regimes. This will provide a more direct link between the proxy validity and the observed divergence patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ranking-consistency measurements on input graphs

full rationale

The paper constructs similarity graphs from embedding distances at chosen percentile thresholds, then runs MIS solvers and measures Kendall's W on the resulting LLM rankings across random seeds. These W values and reduction percentages are direct empirical outputs; no equation defines a target quantity in terms of a fitted parameter that is then re-measured, and no self-citation supplies a load-bearing uniqueness theorem or ansatz. The embedding-to-redundancy premise is an unvalidated modeling assumption rather than a definitional loop. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The central claim rests on several configurable choices (percentile thresholds, embedding models, distance functions) that are exhaustively tested rather than derived; no new entities are postulated.

free parameters (3)
  • percentile threshold p
    Six values tested; controls edge density in the similarity graph and directly affects subset size and ranking divergence.
  • embedding model
    Six different models used to compute node features; choice affects which prompts are deemed similar.
  • distance measure
    Three measures tested; alters the similarity graph construction.
axioms (2)
  • domain assumption Embedding proximity implies functional redundancy for LLM scoring
    Invoked when edges are added above the threshold and when claiming the selected subset remains representative.
  • standard math MIS solvers produce valid independent sets on the constructed graphs
    Standard property of the four cited algorithms (CPLEX, GREEDY, Online-MIS, ReduMIS).

pith-pipeline@v0.9.1-grok · 5802 in / 1431 out tokens · 25936 ms · 2026-06-28T16:54:20.296334+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 19 canonical work pages · 13 internal anchors

  1. [1]

    Anand, R., Aggarwal, D., and Kumar, V. (2017). A comparative analysis of optimization solvers. Journal of Statistics and Management Systems , 20(4):623--635

  2. [2]

    Babakhin, Y., Osmulski, R., Ak, R., Moreira, G., Xu, M., Schifferer, B., Liu, B., and Oldridge, E. (2025). Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks. arXiv preprint arXiv:2511.07025

  3. [3]

    Cenikj, G., Dieter Lang, R., Petrus Engelbrecht, A., Doerr, C., Korošec, P., and Eftimov, T. (2022). SELECTOR: Selecting a Representative Benchmark Suite for Reproducible Statistical Comparison . In Proc. of Genetic and Evolutionary Computation Conference (GECCO)

  4. [4]

    Cenikj, G., Petelin, G., and Eftimov, T. (2024). Transoptas: Transformer-based algorithm selection for single-objective optimization. In Proceedings of the Genetic and Evolutionary Computation Conference Companion , GECCO '24 Companion, page 403–406, New York, NY, USA. Association for Computing Machinery

  5. [5]

    Chen, J., Wang, C., Zhang, G., Ye, P., Bai, L., Hu, W., Qu, Y., and Hu, S. (2025). Learning compact representations of llm abilities via item response theory. arXiv preprint arXiv:2510.00844

  6. [6]

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

  7. [7]

    Dahlum, J., Lamm, S., Sanders, P., Schulz, C., Strash, D., and Werneck, R. F. (2016). Accelerating local search for the maximum independent set problem. In 15th International Symposium on Experimental Algorithms SEA , volume 9685 of Lecture Notes in Computer Science , pages 118--133. Springer

  8. [8]

    Eftimov, T., Petelin, G., Cenikj, G., Kostovska, A., Ispirova, G., Korošec, P., and Bogatinovski, J. (2022). Less is more: Selecting the right benchmarking set of data for time series classification. Expert Systems with Applications , 198:116871

  9. [9]

    Gao, B., Song, F., Yang, Z., Cai, Z., Miao, Y., Dong, Q., Li, L., Ma, C., Chen, L., Tang, Z., et al. (2025). Omni-math: A universal olympiad level mathematic benchmark for large language models. In International Conference on Learning Representations , volume 2025, pages 100540--100569

  10. [10]

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  11. [11]

    Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. A. (2018). Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 107--112

  12. [12]

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. (2021). Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

  13. [13]

    Mistral 7B

    Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. (2023). Mistral 7b. arXiv preprint arXiv:2310.06825

  14. [14]

    Kostovska, A., Cenikj, G., Vermetten, D., Jankovic, A., Nikolikj, A., Skvorc, U., Korosec, P., Doerr, C., and Eftimov, T. (2023). Ps-aas: Portfolio selection for automated algorithm selection in black-box optimization. In International Conference on Automated Machine Learning , pages 11--1. PMLR

  15. [15]

    Lamm, S., Sanders, P., Schulz, C., Strash, D., and Werneck, R. F. (2017). Finding near-optimal independent sets at scale. J. Heuristics , 23(4):207--229

  16. [16]

    Landis, J. R. and Koch, G. G. (1977). The measurement of observer agreement for categorical data. biometrics , pages 159--174

  17. [17]

    Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al. (2022). Holistic evaluation of language models. arXiv preprint arXiv:2211.09110

  18. [18]

    Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. (2024). Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437

  19. [19]

    Mizrahi, M., Kaplan, G., Malkin, D., Dror, R., Shahaf, D., and Stanovsky, G. (2024). State of what art? a call for multi-prompt LLM evaluation. Transactions of the Association for Computational Linguistics , 12:933--949

  20. [20]

    Perlitz, Y., Bandel, E., Gera, A., Arviv, O., Ein-Dor, L., Shnarch, E., Slonim, N., Shmueli-Scheuer, M., and Choshen, L. (2023). Efficient benchmarking of language models. arXiv preprint arXiv:2308.11696

  21. [21]

    tinybenchmarks: Evaluating LLMs with fewer examples

    Polo, F. M., Weber, L., Choshen, L., Sun, Y., Xu, G., and Yurochkin, M. (2024). tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992

  22. [22]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. (2023). Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022

  23. [23]

    Schilling-Wilhelmi, M., Alampara, N., and Jablonka, K. M. (2025). Lifting the benchmark iceberg with item-response theory. In AI for Accelerated Materials Design-ICLR 2025

  24. [24]

    Solatorio, A. V. (2024). Gistembed: Guided in-sample selection of training negatives for text embedding fine-tuning. arXiv preprint arXiv:2402.16829

  25. [25]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805

  26. [26]

    Truong, S., Tu, Y., Liang, P., Li, B., and Koyejo, S. (2025). Reliable and efficient amortized model-based evaluation. arXiv preprint arXiv:2503.13335

  27. [27]

    Vivek, R., Ethayarajh, K., Yang, D., and Kiela, D. (2024). Anchor points: Benchmarking models with much fewer examples. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1576--1601

  28. [28]

    Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F. (2022). Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533

  29. [29]

    Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., et al. (2024). Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems , 37:95266--95290

  30. [30]

    White, C., Dooley, S., Roberts, M., Pal, A., Feuer, B., Jain, S., Shwartz-Ziv, R., Jain, N., Saifullah, K., Naidu, S., et al. (2024). Livebench: A challenging, contamination-free llm benchmark. arXiv preprint arXiv:2406.19314 , 4:2

  31. [31]

    Xiao, S., Liu, Z., Zhang, P., Muennighoff, N., Lian, D., and Nie, J.-Y. (2024). C-pack: Packed resources for general chinese embeddings. In Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval , pages 641--649

  32. [32]

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. (2025). Qwen3 technical report. arXiv preprint arXiv:2505.09388

  33. [33]

    E., and Hardt, M

    Zhang, G., Dorner, F. E., and Hardt, M. (2026). How benchmark prediction from fewer data misses the mark. Advances in Neural Information Processing Systems , 38:135638--135678

  34. [34]

    Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., et al. (2025). Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176

  35. [35]

    Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems , 36:46595--46623

  36. [36]

    Zhou, H., Huang, H., Zhao, Z., Han, L., Wang, H., Chen, K., Yang, M., Bao, W., Dong, J., Xu, B., et al. (2026). Lost in benchmarks? rethinking large language model benchmarking with item response theory. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, pages 35085--35093

  37. [37]

    Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., and Hou, L. (2023). Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911