pith. sign in

arxiv: 2606.24020 · v1 · pith:ITL63L75new · submitted 2026-06-22 · 💻 cs.LG

You Don't Need to Run Every Eval

Pith reviewed 2026-06-26 08:21 UTC · model grok-4.3

classification 💻 cs.LG
keywords benchmark evaluationmatrix completionmodel performance predictionlow-rank structurefrontier modelsevaluation efficiencyscore matrix
0
0 comments X

The pith

Model scores across 133 benchmarks collapse to a rank-2 structure that two numbers largely determine.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Frontier models publish scores on many benchmarks, yet these scores turn out to be highly redundant. A matrix of 84 models on 133 benchmarks is approximately rank-2, so two latent factors recover most of the observed variation and predict held-out entries. BenchPress implements this insight with a logit-space matrix completion procedure that adds a confidence layer for each prediction. The same structure identifies compact subsets of five benchmarks whose results suffice to reconstruct the rest of a model's public scorecard. The finding directly reduces the number of evaluations needed to compare or release new models.

Core claim

The compiled score matrix of 84 frontier models on 133 benchmarks is approximately rank-2. Two factors recover hidden scores to within 4.6 points on average and already explain over 90 percent of the shared variation. BenchPress performs logit-space rank-2 matrix completion together with a per-prediction confidence layer. Subsets of five benchmarks recover the remaining scores to within 3.93 points on average, and a cheaper five-benchmark set reaches 4.55 points.

What carries the argument

BenchPress, a logit-space rank-2 matrix completion method that recovers held-out benchmark scores with a built-in confidence layer.

If this is right

  • Five benchmarks (GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1) recover the rest of a model's scores to within 3.93 points.
  • A cheaper five-benchmark set (GPQA-D, MMLU-Pro, Aider Polyglot, MATH-500, AIME 2026) recovers scores to within 4.55 points.
  • Two factors already explain over 90 percent of the variation among models on the benchmarks they share.
  • The method supplies a per-prediction confidence layer that indicates when each recovered score can be trusted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Many current benchmarks appear to measure overlapping capabilities rather than independent ones.
  • If the low-rank pattern persists, organizations could release models after running only a small fixed set of evaluations.
  • Newly proposed benchmarks could be tested for consistency with the existing two-factor model before adoption.

Load-bearing premise

The rank-2 structure seen in the current collection of models and benchmarks will continue to hold for future models and future benchmarks.

What would settle it

A new model whose scores across the 133 benchmarks require three or more factors to reach the same recovery accuracy as the two-factor model.

Figures

Figures reproduced from arXiv: 2606.24020 by Dimitris Papailiopoulos, Yuchen Zeng.

Figure 1
Figure 1. Figure 1: BENCHPRESS predicts unseen benchmark scores from a handful of revealed ones. Left: For four model– benchmark cells, we hide the target score and reveal k other scores from the same model’s row, in a random order. The y-axis is absolute prediction error on the held-out target cell. Error drops sharply once a few same-row scores are revealed, and reaches zero whenever the target cell itself appears in the re… view at source ↗
Figure 2
Figure 2. Figure 2: Observation pattern of the 84 × 133 score matrix, sorted by coverage. Each row is a model and each column is a bench￾mark; dark cells are observed scores. Only 23.3% of entries are filled. Data quality caveat. Public benchmark scores are heterogeneous measurements, and the matrix treats all of them as comparable. Every score has a source URL for auditability, but users should be aware of the following nois… view at source ↗
Figure 3
Figure 3. Figure 3: Composition and coverage of the adopted score matrix (84 models × 133 benchmarks, 2,604 observed cells, 23.3% fill). The matrix spans a broad mix of benchmark categories (a), with most observed cells in math, coding, agentic/tool-use, and knowledge-oriented evaluations (b). Models are concentrated in recent releases (c), and coverage varies by release time because newer models are often reported on differe… view at source ↗
Figure 4
Figure 4. Figure 4: Held-out MedAPE vs. rank for raw- and logit-space Soft-Impute matrix completion. Both curves bottom out at rank 2 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: MedAPE during probe￾set construction. Pooled MedAPE decreases as selected benchmark scores are revealed; every budget is evaluated on the same observed cells. 0 5 10 Top benchmarks 0% 30% 60% 90% Pairwise accuracy Random Predictive Low-cost [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Hypothesis MedAPE ↓ (p-value) MedAE ↓ (p-value) H1 Low-rank fit p = 0.150 p = 0.036 H2 Score level p < 0.001 p = 0.054 H3 Score spread p < 0.001 p < 0.001 H4 Target coverage p < 0.001 p < 0.001 H5 Strong-neighbor presence p < 0.001 p < 0.001 H6 Strong-neighbor support p = 0.209 p = 0.035 H7 Same-category evidence p = 0.832 p = 0.725 10 1 10 1 10 3 Score std. dev. (log) 10 0 10 1 10 2 MedAPE (log %) H3: Wid… view at source ↗
Figure 9
Figure 9. Figure 9: Representative model-level prediction-error patterns. Five model-side factors jointly affect how hard a model is to predict: reasoning models are easier than non-reasoning ones (H2), higher-scoring models are easier than lower-scoring ones (H3), having at least one strongly correlated peer model in the training matrix makes prediction easier (H5), more observed benchmark scores on the target model makes pr… view at source ↗
Figure 10
Figure 10. Figure 10: Reliability estimates identify safer predictions. Lower curves mean safer subsets are identified more reliably; the hybrid estimator gives the cleanest ordering. Point predictions alone are not enough for deciding when to skip bench￾mark runs. If a predicted score will be used to decide whether to skip an expensive evaluation, the useful question is not only “what score would this model get?”, but also “h… view at source ↗
Figure 11
Figure 11. Figure 11: Full transform × method grid (7 transforms, 12 methods) across the Section 4 metrics: MedAPE, MedAE, and coverage. Each cell reports the best hyperparameter configuration for that pair, evaluated as the median over 10 seeds × 3 folds. All methods operate in standardized space. Green = better. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: All seven active benchmark-level hypotheses against both score-error metrics. The left block shows H1–H3 (correlational hypotheses) and the right block shows H4–H7 (ablations). Columns within each block are MedAPE (↓) and MedAE (↓). Correlational rows show scatter + binned trend; ablation rows show line plots across drop fractions (H4, H6) or paired bars at the headline intervention (H5 with |r| ≥0.85 pee… view at source ↗
Figure 13
Figure 13. Figure 13: Per-benchmark predictability. For each model, half of observed scores are held out and predicted via BENCHPRESS; errors are aggregated by benchmark column (10 seeds). Benchmarks below the 15% MedAPE threshold (dashed red line) are well-predicted by others. Color = benchmark category. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Selected settings for all nine model-level hypotheses against both score-error metrics. The left block shows H1–H5 and the right block shows H6–H9. Columns within each block are MedAPE (↓) and MedAE (↓). H1–H4 are correlational rows (H2 grouped bars), H5–H8 are ablations, and H9 is temporal. Ablation rows show paired bars at the headline intervention (H5: |r| ≥ 0.95 peers; H6: 75% strongest-peer overlap m… view at source ↗
Figure 15
Figure 15. Figure 15: Per-model predictability. For each model, half of observed scores are held out and predicted via BENCHPRESS; errors are aggregated by model row (10 seeds). Models below the 15% MedAPE threshold (dashed red line) are well-predicted by others. Color = provider. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_15.png] view at source ↗
read the original abstract

A modern model release reports scores on 40+ benchmarks and the same evaluations were run many more times before it: to track training progress, compare design choices, and select the checkpoint for the release. But do we need to run every eval? We compile a public score matrix of 84 frontier models on 133 benchmarks (2,604 cells, 23.3% filled) and find it is approximately rank-2: a model's scores across all 133 benchmarks are largely determined by just two numbers. We confirm this in two ways: scores hidden from the matrix are best recovered using two factors, and two factors already explain over 90% of the variation among models on the benchmarks they share. Building on this, we design BenchPress: a logit-space rank-2 matrix completion method that recovers held-out scores to within 4.6 points, and a confidence layer that says when each prediction can be trusted. Using BenchPress, we find a subset of five benchmarks {GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1} that can recover the rest of a model's public scorecard to within 3.93 points. For a tighter inference budget, a cheaper set {GPQA-D, MMLU-Pro, Aider Polyglot, MATH-500, AIME 2026} can predict a model's evals to within 4.55. We release the score matrix, the BenchPress code, and an interactive tool that predicts any model's score on any benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript compiles a sparse score matrix of 84 frontier models on 133 benchmarks (2,604 cells, 23.3% filled) and claims it is approximately rank-2, with a model's scores largely determined by two latent factors. This is supported by two confirmations: held-out scores are best recovered with two factors, and two factors explain over 90% of variance on shared benchmarks. BenchPress, a logit-space rank-2 matrix completion method with a confidence layer, recovers held-out scores to within 4.6 points. Small benchmark subsets (e.g., {GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1}) recover the full scorecard to within 3.93 points (or 4.55 for a cheaper set), and the matrix, code, and interactive tool are released.

Significance. If the rank-2 structure generalizes, the work could meaningfully reduce redundant evaluations in ML by enabling accurate prediction of many benchmark scores from a small subset, lowering compute costs for model development and comparison. The quantitative held-out recovery results, variance explanation, and open release of the score matrix, BenchPress code, and tool are concrete strengths that support reproducibility and further use.

major comments (1)
  1. [Abstract] Abstract and main results on generalization: the central claim that subsets suffice to recover a model's full public scorecard (and thus that not every eval needs to be run) depends on the rank-2 structure persisting for models and benchmarks outside the collected 84×133 matrix. Held-out recovery within the matrix (4.6 points) and the >90% variance claim provide internal checks, but no external validation on new models or new benchmarks is reported, leaving the generalization premise untested and load-bearing for the practical conclusions.
minor comments (1)
  1. [Abstract] The abstract reports recovery errors (4.6, 3.93) without reference to the typical score ranges or normalization of the benchmarks, which would help readers assess practical significance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the clear identification of the generalization issue. We address the single major comment below and agree that a revision is warranted to clarify the scope of the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract and main results on generalization: the central claim that subsets suffice to recover a model's full public scorecard (and thus that not every eval needs to be run) depends on the rank-2 structure persisting for models and benchmarks outside the collected 84×133 matrix. Held-out recovery within the matrix (4.6 points) and the >90% variance claim provide internal checks, but no external validation on new models or new benchmarks is reported, leaving the generalization premise untested and load-bearing for the practical conclusions.

    Authors: We agree this is a substantive limitation. All quantitative results (held-out recovery to 4.6 points, >90% variance explained, and the 3.93/4.55-point subset recoveries) are internal to the 84×133 matrix; no experiments on models or benchmarks outside this collection are presented. The abstract and introduction frame the work as evidence that the observed matrix is approximately rank-2 and that small subsets suffice to recover the remainder of that matrix, but the broader practical claim that “not every eval needs to be run” implicitly relies on the structure generalizing. We will revise the abstract to emphasize that the findings apply to the collected data, add an explicit Limitations paragraph stating that external validation on new models and benchmarks remains untested, and note this as an important direction for follow-up work. These changes will make the claims precise without overstating the evidence. revision: yes

Circularity Check

0 steps flagged

No circularity; rank-2 observation and held-out recovery are empirically independent

full rationale

The paper compiles an observed 84×133 score matrix (23.3% dense) and reports that it is approximately rank-2, confirmed by two internal checks: (1) held-out cells are recovered best with two factors, and (2) two factors explain >90% variance on shared benchmarks. BenchPress is a rank-2 matrix-completion procedure whose performance is measured on those same held-out entries. Subset selection (e.g., the five-benchmark set) is likewise evaluated by recovery error on the remaining observed cells. None of these steps reduces by construction to the inputs; the held-out recovery constitutes a genuine cross-validation test within the collected data. No self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. Generalization beyond the observed matrix is an untested modeling assumption, but that is a question of external validity, not circularity. The derivation chain is therefore self-contained against the internal benchmarks it reports.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central modeling choice is the low-rank assumption; the rank value itself is selected from data.

free parameters (1)
  • rank = 2
    Selected because two factors explain over 90% of variation on shared benchmarks.
axioms (1)
  • domain assumption Benchmark scores can be modeled as a rank-2 matrix in logit space.
    Core assumption enabling BenchPress matrix completion and subset selection.

pith-pipeline@v0.9.1-grok · 5804 in / 1299 out tokens · 59825 ms · 2026-06-26T08:21:09.496596+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

98 extracted references · 25 canonical work pages · 17 internal anchors

  1. [1]

    Foundations of Computational Mathematics , volume=

    Exact Matrix Completion via Convex Optimization , author=. Foundations of Computational Mathematics , volume=

  2. [2]

    Journal of Machine Learning Research , volume=

    Spectral Regularization Algorithms for Learning Large Incomplete Matrices , author=. Journal of Machine Learning Research , volume=

  3. [3]

    Computer , volume=

    Matrix Factorization Techniques for Recommender Systems , author=. Computer , volume=. 2009 , publisher=

  4. [4]

    International Conference on Machine Learning , year=

    tinyBenchmarks: evaluating LLMs with fewer examples , author=. International Conference on Machine Learning , year=

  5. [5]

    Findings of EMNLP , year=

    How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench , author=. Findings of EMNLP , year=

  6. [6]

    arXiv preprint arXiv:2506.07673 , year=

    How Benchmark Prediction from Fewer Data Misses the Mark , author=. arXiv preprint arXiv:2506.07673 , year=

  7. [7]

    arXiv preprint arXiv:2410.02223 , year=

    EmbedLLM: Learning Compact Representations of Large Language Models , author=. arXiv preprint arXiv:2410.02223 , year=

  8. [8]

    Intelligence , volume=

    Evidence of Interrelated Cognitive-Like Capabilities in Large Language Models: Indications of Artificial General Intelligence or Achievement? , author=. Intelligence , volume=. 2024 , doi=

  9. [9]

    2026 , eprint=

    Rubinstein, Alexander and Raible, Benjamin and Gubri, Martin and Oh, Seong Joon , booktitle=. 2026 , eprint=

  10. [10]

    2025 , howpublished=

    Benchmark Scores = General Capability + Claudiness , author=. 2025 , howpublished=

  11. [11]

    arXiv preprint arXiv:2512.00193 , year=

    A Rosetta Stone for AI Benchmarks , author=. arXiv preprint arXiv:2512.00193 , year=

  12. [12]

    Rethink Reporting of Evaluation Results in

    Burnell, Ryan and others , journal=. Rethink Reporting of Evaluation Results in

  13. [13]

    NeurIPS , year=

    Observational Scaling Laws and the Predictability of Language Model Performance , author=. NeurIPS , year=

  14. [14]

    EMNLP , year=

    Collaborative Performance Prediction for Large Language Models , author=. EMNLP , year=

  15. [15]

    2025 , eprint=

    Kipnis, Alexander and others , booktitle=. 2025 , eprint=

  16. [16]

    Sloth: Scaling Laws for

    Polo, Felipe Maia and Somerstep, Seamus and Choshen, Leshem and Sun, Yuekai and Yurochkin, Mikhail , booktitle=. Sloth: Scaling Laws for. 2025 , eprint=

  17. [17]

    and Stoica, Ion , booktitle=

    Chiang, Wei-Lin and Zheng, Lianmin and Sheng, Ying and Angelopoulos, Anastasios Nikolas and Li, Tianle and Li, Dacheng and Zhu, Banghua and Zhang, Hao and Jordan, Michael and Gonzalez, Joseph E. and Stoica, Ion , booktitle=. Chatbot Arena: An Open Platform for Evaluating. 2024 , eprint=

  18. [18]

    2026 , howpublished=

  19. [19]

    Liu, Zirui and others , booktitle=. am-. 2025 , note=

  20. [20]

    Performance Prediction via

    Schram, Viktoria and Beck, Daniel and Cohn, Trevor , booktitle=. Performance Prediction via

  21. [21]

    Proceedings of NAACL , pages=

    Efficient Benchmarking (of Language Models) , author=. Proceedings of NAACL , pages=

  22. [22]

    Benchmark Agreement Testing Done Right: A Guide for

    Perlitz, Yotam and Gera, Ariel and Arviv, Ofir and Yehudai, Asaf and Bandel, Elron and Shnarch, Eyal and Shmueli-Scheuer, Michal and Choshen, Leshem , journal=. Benchmark Agreement Testing Done Right: A Guide for

  23. [23]

    What Are the Best Systems? New Perspectives on

    Colombo, Pierre and Noiry, Nathan and Irurozki, Ekhine and Clemencon, Stephan , booktitle=. What Are the Best Systems? New Perspectives on

  24. [24]

    ICLR , year=

    On Speeding Up Language Model Evaluation , author=. ICLR , year=. 2407.06172 , archivePrefix=

  25. [25]

    arXiv preprint arXiv:2306.10062 , year=

    Revealing the Structure of Language Model Capabilities , author=. arXiv preprint arXiv:2306.10062 , year=

  26. [26]

    Ni, Jinjie and Xue, Fuzhao and Yue, Xiang and Deng, Yuntian and Shah, Mahir and Jain, Kabir and Neubig, Graham and You, Yang , booktitle=

  27. [27]

    Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

    Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings , author=. arXiv preprint arXiv:2510.26384 , year=

  28. [28]

    Findings of ACL , year=

    Pacchiardi, Lorenzo and Voudouris, Konstantinos and Slater, Ben and Mart. Findings of ACL , year=

  29. [29]

    Look Before You Leap: Estimating

    Park, Jungsoo and Mendes, Ethan and Stanovsky, Gabriel and Ritter, Alan , journal=. Look Before You Leap: Estimating

  30. [30]

    Predicting

    Koh, Woosung and Suk, Juyoung and Han, Sungjun and Yun, Se-Young and Shin, Jamin , booktitle=. Predicting. 2026 , eprint=

  31. [31]

    Proceedings of EACL , year=

    Anchor Points: Benchmarking Models with Much Fewer Examples , author=. Proceedings of EACL , year=

  32. [32]

    Saranathan, Gayathri and Xu, Cong and Alam, Mahammad Parwez and Kumar, Tarun and Foltin, Martin and Wong, Soon Yee and Bhattacharya, Suparna , booktitle=

  33. [33]

    Rethinking

    Wang, Shaobo and others , booktitle=. Rethinking. 2026 , eprint=

  34. [34]

    Journal of the Royal Statistical Society: Series B , volume=

    Estimating the number of clusters in a data set via the gap statistic , author=. Journal of the Royal Statistical Society: Series B , volume=. 2001 , publisher=

  35. [35]

    2026 , howpublished=

    Qwen3.5-397B-A17B , author=. 2026 , howpublished=

  36. [36]

    2026 , howpublished=

    Kimi K2.5 , author=. 2026 , howpublished=

  37. [37]

    2024 , howpublished=

  38. [38]

    2025 , howpublished=

  39. [39]

    Let's Verify Step by Step

    Let's Verify Step by Step , author=. ICLR , year=. 2305.20050 , archivePrefix=

  40. [40]

    and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle=

    Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle=

  41. [41]

    2024 , howpublished=

    Introducing. 2024 , howpublished=

  42. [42]

    Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , journal=

  43. [43]

    Transactions on Machine Learning Research , year=

    Holistic Evaluation of Language Models , author=. Transactions on Machine Learning Research , year=

  44. [44]

    ICLR , year=

    Measuring Massive Multitask Language Understanding , author=. ICLR , year=

  45. [45]

    , booktitle=

    Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , booktitle=

  46. [46]

    2025 , howpublished=

    International Mathematical Olympiad , author=. 2025 , howpublished=

  47. [47]

    Glazer, Elliot and Erdil, Ege and Besiroglu, Tamay and Chicharro, Diego and Chen, Evan and Gunning, Alex and Olsson, Caroline Falkman and Denain, Jean-Stanislas and Ho, Anson and Santos, Emily de Oliveira and others , journal=

  48. [48]

    2025 , howpublished=

    Balunovi. 2025 , howpublished=

  49. [49]

    Evaluating Large Language Models Trained on Code

    Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=

  50. [50]

    2009 , howpublished=

    Codeforces , author=. 2009 , howpublished=

  51. [51]

    Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others , booktitle=

  52. [52]

    Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and others , booktitle=

  53. [53]

    and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Ou, Tianyue and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham , booktitle=

    Zhou, Shuyan and Xu, Frank F. and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Ou, Tianyue and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham , booktitle=

  54. [54]

    Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , journal=

  55. [55]

    On the Measure of Intelligence

    On the Measure of Intelligence , author=. arXiv preprint arXiv:1911.01547 , year=

  56. [56]

    ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

    Chollet, Fran. arXiv preprint arXiv:2505.11831 , year=

  57. [57]

    Humanity's Last Exam

    Humanity's Last Exam , author=. arXiv preprint arXiv:2501.14249 , year=

  58. [58]

    Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others , booktitle=

  59. [59]

    Measuring short-form factuality in large language models

    Measuring Short-Form Factuality in Large Language Models , author=. arXiv preprint arXiv:2411.04368 , year=

  60. [60]

    Wei, Jason and Sun, Zhiqing and Papay, Spencer and McKinney, Scott and Han, Jeffrey and Fulford, Isa and Chung, Hyung Won and Passos, Alex Tachard and Fedus, William and Glaese, Amelia , journal=

  61. [61]

    Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and Liu, Ruoqi and Zhang, Ge and Stevens, Samuel and Jiang, Dongfu and Ren, Weiming and Sun, Yuxuan and others , booktitle=

  62. [62]

    Yue, Xiang and Zheng, Tianyu and Ni, Yuansheng and Wang, Yubo and Zhang, Kai and Tong, Shengbang and Sun, Yuxuan and Yu, Botao and Zhang, Ge and Sun, Huan and others , journal=

  63. [63]

    Hu, Kairui and Wu, Penghao and Pu, Fanyi and Xiao, Wang and Zhang, Yuanhan and Yue, Xiang and Li, Bo and Liu, Ziwei , journal=

  64. [64]

    Instruction-Following Evaluation for Large Language Models

    Instruction-Following Evaluation for Large Language Models , author=. arXiv preprint arXiv:2311.07911 , year=

  65. [65]

    Pyatkin, Valentina and others , journal=

  66. [66]

    and Stoica, Ion , journal=

    Li, Tianle and Chiang, Wei-Lin and Frick, Evan and Dunlap, Lisa and Wu, Tianhao and Zhu, Banghua and Gonzalez, Joseph E. and Stoica, Ion , journal=. From Crowdsourced Data to High-Quality Benchmarks:

  67. [67]

    GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

    Patwardhan, Tejal and others , year=. 2510.04374 , archivePrefix=

  68. [68]

    2025 , howpublished=

    Artificial Analysis Long Context Reasoning Benchmark , author=. 2025 , howpublished=

  69. [69]

    2025 , howpublished=

    Stanford Math Tournament 2025 , author=. 2025 , howpublished=

  70. [70]

    2025 , eprint=

    White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Ben and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Naidu, Siddartha and others , booktitle=. 2025 , eprint=

  71. [71]

    Training Verifiers to Solve Math Word Problems

    Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

  72. [72]
  73. [73]

    2025 , howpublished=

    Brown University Mathematics Olympiad 2025 , author=. 2025 , howpublished=

  74. [74]

    OpenAI GPT-5 System Card

    2025 , howpublished=. 2601.03267 , archivePrefix=

  75. [75]

    2026 , howpublished=

    Introducing. 2026 , howpublished=

  76. [76]

    Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and others , journal=. The

  77. [77]

    arXiv preprint arXiv:2412.19437 , year=

  78. [78]

    Nature , volume=

    Learning the Parts of Objects by Non-negative Matrix Factorization , author=. Nature , volume=

  79. [79]

    NeurIPS , year=

    Probabilistic Matrix Factorization , author=. NeurIPS , year=

  80. [80]

    Foundations and Trends in Machine Learning , volume=

    Conformal Prediction: A Gentle Introduction , author=. Foundations and Trends in Machine Learning , volume=. 2023 , doi=

Showing first 80 references.