pith. machine review for the scientific record. sign in

arxiv: 2605.07046 · v1 · submitted 2026-05-07 · 📊 stat.ML · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

An Interpretable and Scalable Framework for Evaluating Large Language Models

Hao Zeng, Qiang Heng, Xiaoqian Liu, Xinhao Qu

Pith reviewed 2026-05-11 01:05 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG
keywords large language modelsitem response theorymatrix factorizationmodel evaluationscalabilityinterpretabilitymajorization-minimizationbenchmark design
0
0 comments X

The pith

Reformulating item response theory as constrained matrix factorization lets researchers evaluate large language models faster and more stably.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that applying the majorization-minimization principle turns the slow, unstable parameter estimation of item response theory into a sequence of constrained matrix factorization subproblems. This matters for a sympathetic reader because standard LLM benchmarks rely on simple accuracy averages that ignore output variability and item differences, while conventional IRT methods cannot scale to large leaderboards or many models. The reformulation supplies theoretical guarantees on identifiability and convergence, produces orders-of-magnitude speedups, and preserves or improves accuracy on synthetic data and real benchmarks such as MATH-500 and Open LLM Leaderboard tasks. The resulting estimates also reveal item difficulty and discrimination patterns that align with known scaling laws and can guide better benchmark construction.

Core claim

Our approach reformulates the problem as a sequence of constrained matrix factorization subproblems, enabling stable and efficient parameter estimation with theoretical guarantees for identifiability and convergence. Experiments on synthetic and real-world datasets, including MATH-500 and six Open LLM Leaderboard benchmarks, demonstrate that our method achieves superior scalability and interpretability. It delivers orders-of-magnitude speedups over competing methods while maintaining comparable or even higher estimation accuracy.

What carries the argument

The majorization-minimization principle that converts item response theory estimation into a sequence of constrained matrix factorization subproblems.

If this is right

  • Computation time for full IRT-based evaluation drops by orders of magnitude.
  • Parameter estimates remain stable and at least as accurate as prior methods.
  • Item difficulty and discrimination values become available to inform benchmark design.
  • The estimates are consistent with established LLM scaling laws.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The speed gain could make routine IRT analysis feasible for every new model release and every large benchmark.
  • Item-level insights might help construct balanced test sets that avoid over-representing easy or hard items.
  • If the underlying response model holds, the same factorization approach could extend to evaluating other stochastic systems such as reinforcement-learning agents.

Load-bearing premise

That the standard item response theory model, with its logistic relation between latent ability and item parameters, adequately describes LLM response patterns.

What would settle it

A new benchmark where the ability estimates produced by this method fail to predict performance on held-out items or deviate from observed scaling-law trends.

Figures

Figures reproduced from arXiv: 2605.07046 by Hao Zeng, Qiang Heng, Xiaoqian Liu, Xinhao Qu.

Figure 1
Figure 1. Figure 1: Scalability comparison across six large-scale benchmark suites. Runtime (sec) (left axis, [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: cBMM-recovered model abilities θˆ for LLMs in the MATH-500 benchmark suite. The three subplots (left, middle, right) correspond to three model scale tiers: small (0–7B), medium (7–20B), and large (>20B). All panels span the same release date range, from July 2023 to September 2025. LLMs with the highest θˆ are highlighted in each subplot, along with the LLM with the lowest θˆ in the large-size subplot. whi… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of cBMM-recovered item difficulties (−bˆ) across human-annotated difficulty levels (1-5) in the MATH-500 benchmark suite, shown for the overall and subject-specific results [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Spearman correlation between cBMM-recovered model abilities and average accuracies v.s. the coefficient of variation of cBMM-recovered aˆ. The left panel shows results for the top 20% of models selected by average accuracy, while the right panel includes all LLMs. Each scatter point corresponds to a benchmark suite, as annotated in the plot. Fine-grained ranking shifts under discrimination heterogeneity [… view at source ↗
Figure 5
Figure 5. Figure 5: Ranking flow between average accuracy (left column in each panel) and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Evaluation of large language models (LLMs) is increasingly critical, yet standard benchmarking methods rely on average accuracy, overlooking both the inherent stochasticity of LLM outputs and the heterogeneity of benchmark items. Item Response Theory (IRT) offers a principled framework for modeling latent model abilities and item characteristics, but conventional methods are computationally expensive and numerically unstable, limiting large-scale implementations. To address these challenges, we propose an interpretable and scalable framework for LLM evaluation based on the majorization-minimization principle. Our approach reformulates the problem as a sequence of constrained matrix factorization subproblems, enabling stable and efficient parameter estimation with theoretical guarantees for identifiability and convergence. Experiments on synthetic and real-world datasets, including MATH-500 and six Open LLM Leaderboard benchmarks, demonstrate that our method achieves superior scalability and interpretability. It delivers orders-of-magnitude speedups over competing methods while maintaining comparable or even higher estimation accuracy. Our results align with established scaling laws and offer insights into item difficulty and discrimination, informing more principled benchmark design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce a scalable IRT-based framework for LLM evaluation by reformulating latent ability and item parameter estimation as a sequence of constrained matrix factorization subproblems solved via majorization-minimization. This yields theoretical guarantees for identifiability and convergence, orders-of-magnitude speedups over baselines, and comparable or superior accuracy on synthetic data plus real benchmarks (MATH-500 and six Open LLM Leaderboard tasks), while providing interpretable insights into item difficulty and discrimination that align with scaling laws.

Significance. If the central claims hold, the work would enable efficient large-scale IRT application to LLM evaluation, moving beyond average accuracy to model stochasticity and item heterogeneity. The matrix-factorization reformulation addresses a key computational barrier in traditional IRT, and the empirical results on established benchmarks plus alignment with scaling laws are strengths. However, these benefits are conditional on the IRT model being appropriate for LLM response patterns.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method): The abstract asserts 'theoretical guarantees for identifiability and convergence' from the majorization-minimization reformulation, but no derivation, proof sketch, or reference to supplementary material is provided. This is load-bearing for the scalability and stability claims.
  2. [§5] §5 (experiments): No error bars, statistical significance tests, or ablation on IRT model fit (e.g., goodness-of-fit diagnostics, residual analysis, or comparison to non-IRT baselines) are reported for the real datasets (MATH-500, Open LLM Leaderboard). Without these, the claim of 'higher estimation accuracy' cannot be distinguished from optimization artifacts or model misspecification.
minor comments (2)
  1. [§3] The description of the constraint sets in the matrix factorization subproblems would benefit from explicit notation or a small illustrative example in the main text.
  2. [§4] Consider adding a brief discussion of how the method handles the stochasticity of LLM outputs beyond the standard logistic IRT link.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the presentation of our work. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method): The abstract asserts 'theoretical guarantees for identifiability and convergence' from the majorization-minimization reformulation, but no derivation, proof sketch, or reference to supplementary material is provided. This is load-bearing for the scalability and stability claims.

    Authors: We agree that an explicit derivation strengthens the claims. The majorization-minimization reformulation yields identifiability from the constrained matrix factorization (via uniqueness of the low-rank decomposition under the given constraints) and convergence from the standard MM properties (monotonic decrease of the objective and boundedness). In the revised version, we add a concise proof sketch to §3 and explicitly reference the supplementary material containing the full proofs. revision: yes

  2. Referee: [§5] §5 (experiments): No error bars, statistical significance tests, or ablation on IRT model fit (e.g., goodness-of-fit diagnostics, residual analysis, or comparison to non-IRT baselines) are reported for the real datasets (MATH-500, Open LLM Leaderboard). Without these, the claim of 'higher estimation accuracy' cannot be distinguished from optimization artifacts or model misspecification.

    Authors: We acknowledge the need for greater statistical transparency on real data. In the revision, we will report error bars (standard deviations across repeated runs with different initializations), conduct statistical significance tests (paired t-tests against baselines), and add model-fit diagnostics including residual analysis and a direct comparison to non-IRT baselines such as mean accuracy. These additions will be included for both MATH-500 and the Open LLM Leaderboard tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained optimization reformulation

full rationale

The paper reformulates standard IRT parameter estimation for LLM responses as a sequence of constrained matrix-factorization subproblems solved by majorization-minimization, claiming identifiability, convergence, and scalability gains. No quoted step defines target quantities in terms of fitted outputs, renames a known result, imports uniqueness via self-citation, or smuggles an ansatz; the central claims rest on the external IRT model plus the new solver, with experiments on MATH-500 and Open LLM Leaderboard providing independent validation. This is the common honest case of an applied optimization contribution whose correctness hinges on model assumptions rather than internal circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the standard IRT probabilistic model for binary or graded responses and on the majorization-minimization principle for non-convex optimization; no new entities are postulated.

free parameters (1)
  • latent ability and item parameters
    Estimated from data via the proposed solver; not fixed in advance.
axioms (1)
  • domain assumption LLM responses are generated according to an IRT model with logistic or similar link
    Invoked to justify modeling latent abilities and item characteristics from observed correctness data.

pith-pipeline@v0.9.0 · 5481 in / 1182 out tokens · 42704 ms · 2026-05-11T01:05:24.102350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

66 extracted references · 22 canonical work pages · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2024

  2. [2]

    Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, et al

    Berk Atıl, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J. Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, et al. Non-determinism of ‘deterministic’ LLM system settings in hosted environments. InProceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems, pages 135–148, Mumbai, India, December ...

  3. [3]

    Hédy Attouch, Jérôme Bolte, and Benar Fux Svaiter. Convergence of descent methods for semi-algebraic and tame problems: Proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods.Mathematical Programming, 137(1):91–129, 2013

  4. [4]

    Baker and Seock-Ho Kim, editors.Item Response Theory: Parameter Estimation Techniques

    Frank B. Baker and Seock-Ho Kim, editors.Item Response Theory: Parameter Estimation Techniques. CRC Press, 2nd edition, 2004

  5. [5]

    Balunović, J

    Mislav Balunovi ´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi ´c, and Martin Vechev. MathArena: Evaluating LLMs on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2026

  6. [6]

    Bhaskar and Adel Javanmard

    Sonia A. Bhaskar and Adel Javanmard. 1-bit matrix completion under exact low-rank constraint. In2015 49th Annual Conference on Information Sciences and Systems (CISS), pages 1–6, 2015

  7. [7]

    Proximal alternating linearized minimization for nonconvex and nonsmooth problems.Mathematical Programming, 146(1):459–494, 2014

    Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems.Mathematical Programming, 146(1):459–494, 2014

  8. [8]

    Absil, and Rodolphe Sepulchre

    Nicolas Boumal, Bamdev Mishra, P.-A. Absil, and Rodolphe Sepulchre. Manopt, a Matlab toolbox for optimization on manifolds.Journal of Machine Learning Research, 15(42):1455– 1459, 2014

  9. [9]

    Ullman, Fernando Martinez-Plumed, Joshua B

    Ryan Burnell, Wout Schellaert, John Burden, Tomer D. Ullman, Fernando Martinez-Plumed, Joshua B. Tenenbaum, Danaja Rutar, Lucy G. Cheke, Jascha Sohl-Dickstein, Melanie Mitchell, et al. Rethink reporting of evaluation results in AI.Science, 380(6641):136–138, 2023

  10. [10]

    Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu

    Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization.SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995

  11. [11]

    A max-norm constrained minimization approach to 1-bit matrix completion.Journal of Machine Learning Research, 14(78):3619–3647, 2013

    Tony Cai and Wen-Xin Zhou. A max-norm constrained minimization approach to 1-bit matrix completion.Journal of Machine Learning Research, 14(78):3619–3647, 2013

  12. [12]

    Rethinking math benchmarks: Implications for AI in education

    Jane Castleman, Nimra Nadeem, Tanvi Namjoshi, and Lydia Liu. Rethinking math benchmarks: Implications for AI in education. InProceedings of the Innovation and Responsibility in AI- Supported Education Workshop, volume 273 ofProceedings of Machine Learning Research, pages 66–82. PMLR, 03 Mar 2025

  13. [13]

    When benchmarks leak: Inference-time decontamina- tion for LLMs.arXiv preprint arXiv:2601.19334, 2026

    Jianzhe Chai, Yu Zhe, and Jun Sakuma. When benchmarks leak: Inference-time decontamina- tion for LLMs.arXiv preprint arXiv:2601.19334, 2026

  14. [14]

    Philip Chalmers

    R. Philip Chalmers. mirt: A multidimensional item response theory package for the R environ- ment.Journal of Statistical Software, 48(6):1–29, 2012

  15. [15]

    A Simple Framework for Contrastive Learning of Visual Representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations.arXiv preprint arXiv:2002.05709, 2020

  16. [16]

    Item response theory—a statistical framework for educational and psychological measurement.Statistical Science, 40(2):167 – 194, 2025

    Yunxiao Chen, Xiaoou Li, Jingchen Liu, and Zhiliang Ying. Item response theory—a statistical framework for educational and psychological measurement.Statistical Science, 40(2):167 – 194, 2025. 10

  17. [17]

    A survey on large language mod- els for critical societal domains: Finance, healthcare, and law.arXiv preprint arXiv:2405.01769, 2024

    Zhiyu Zoey Chen, Jing Ma, Xinlu Zhang, Nan Hao, An Yan, Armineh Nourbakhsh, Xianjun Yang, Julian McAuley, Linda Petzold, and William Yang Wang. A survey on large language mod- els for critical societal domains: Finance, healthcare, and law.arXiv preprint arXiv:2405.01769, 2024

  18. [18]

    Introducing SWE-bench verified, 2024

    Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, et al. Introducing SWE-bench verified, 2024

  19. [19]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  20. [20]

    Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters

    Mark A. Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 1-bit matrix completion.Information and Inference: A Journal of the IMA, 3(3):189–223, 07 2014

  21. [21]

    A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 9 1977

  22. [22]

    Large language model agents in finance: A survey bridging research, practice, and real-world deployment

    Yifei Dong, Fengyi Wu, Kunlin Zhang, Yilong Dai, Sanjian Zhang, Wanghao Ye, Sihan Chen, and Zhi-Qi Cheng. Large language model agents in finance: A survey bridging research, practice, and real-world deployment. InFindings of the Association for Computational Lin- guistics: EMNLP 2025, pages 17889–17907, Suzhou, China, November 2025. Association for Comput...

  23. [23]

    Can we trust AI benchmarks? an interdisciplinary review of current issues in AI evaluation.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 8(1):850–864, Oct

    Maria Eriksson, Erasmo Purificato, Arman Noroozian, João Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. Can we trust AI benchmarks? an interdisciplinary review of current issues in AI evaluation.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 8(1):850–864, Oct. 2025

  24. [24]

    Open LLM Leaderboard v2

    Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open LLM Leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/ open_llm_leaderboard, 2024

  25. [25]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017

  26. [26]

    Hambleton and Hariharan Swaminathan.Item Response Theory: Principles and Applications

    Ronald K. Hambleton and Hariharan Swaminathan.Item Response Theory: Principles and Applications. Springer Dordrecht, Dordrecht, 1 edition, 1985

  27. [27]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

  28. [28]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

  29. [29]

    A tutorial on MM algorithms.The American Statistician, 58(1):30–37, 2004

    David R Hunter and Kenneth Lange. A tutorial on MM algorithms.The American Statistician, 58(1):30–37, 2004

  30. [30]

    David Ili´c and Gilles E. Gignac. Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement?Intelligence, 106:101858, 2024

  31. [31]

    SWE-bench: Can language models resolve real-world Github issues? InThe Twelfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world Github issues? InThe Twelfth International Conference on Learning Representations, 2024. 11

  32. [32]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

  33. [33]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  34. [34]

    Lim- itations of large language models in clinical problem-solving arising from inflexible reasoning

    Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, and Danilo Bernardo. Lim- itations of large language models in clinical problem-solving arising from inflexible reasoning. Scientific Reports, 15(1):39426, 2025

  35. [35]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  36. [36]

    py-irt: A scalable item response theory library for Python.INFORMS Journal on Computing, 35(1):5–13, 2023

    John Patrick Lalor and Pedro Rodriguez. py-irt: A scalable item response theory library for Python.INFORMS Journal on Computing, 35(1):5–13, 2023

  37. [37]

    Hunter, and Ilsoon Yang

    Kenneth Lange, David R. Hunter, and Ilsoon Yang. Optimization transfer using surrogate objective functions.Journal of Computational and Graphical Statistics, 9(1):1–20, 2000

  38. [38]

    Lawson and Richard J

    Charles L. Lawson and Richard J. Hanson.Solving Least Squares Problems. Society for Industrial and Applied Mathematics, 1995

  39. [39]

    Estimating contamination via perplexity: Quantifying memorisation in language model evaluation.arXiv preprint arXiv:2309.10677, 2023

    Yucheng Li. Estimating contamination via perplexity: Quantifying memorisation in language model evaluation.arXiv preprint arXiv:2309.10677, 2023

  40. [40]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024

  41. [41]

    Bridging artificial intelligence and biological sciences: A comprehensive review of large language models in bioinformatics.Briefings in Bioinformatics, 26, 07 2025

    Anqi Lin, Junpu Ye, Chang Qi, Lingxuan Zhu, Weiming Mou, Wenyi Gan, Dongqiang Zeng, Bufu Tang, Mingjia Xiao, Guangdi Chu, et al. Bridging artificial intelligence and biological sciences: A comprehensive review of large language models in bioinformatics.Briefings in Bioinformatics, 26, 07 2025

  42. [42]

    Chi, and Boaz Nadler

    Xiaoqian Liu, Xu Han, Eric C. Chi, and Boaz Nadler. A majorization-minimization Gauss- Newton method for 1-bit matrix completion.Journal of Computational and Graphical Statistics, 34(3):1017–1029, 2025

  43. [43]

    Une propriété topologique des sous-ensembles analytiques réels.Les Équations aux Dérivées Partielles, 117:87–89, 1963

    Stanisław Łojasiewicz. Une propriété topologique des sous-ensembles analytiques réels.Les Équations aux Dérivées Partielles, 117:87–89, 1963

  44. [44]

    Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

    Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, May 2022. Association for Comp...

  45. [45]

    GAIA: a benchmark for general AI assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,International Conference on Learning Representations, volume 2024, pages 9025–9049, 2024

  46. [46]

    Adversarial NLI: A new benchmark for natural language understanding

    Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial NLI: A new benchmark for natural language understanding. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online, July 2020. Association for Computational Linguistics

  47. [47]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, page...

  48. [48]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof Q&A benchmark.arXiv preprint arXiv:2311.12022, 2023

  49. [49]

    XSTest: A test suite for identifying exaggerated safety behaviours in large language models

    Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

  50. [50]

    Survey of vulnerabilities in large language models revealed by adversarial attacks,

    Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael Abu- Ghazaleh. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844, 2023

  51. [51]

    Analyzing uncertainty of LLM-as-a-Judge: Interval evaluations with conformal prediction

    Huanxin Sheng, Xinyi Liu, Hangfeng He, Jieyu Zhao, and Jian Kang. Analyzing uncertainty of LLM-as-a-Judge: Interval evaluations with conformal prediction. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 11286–11328, Suzhou, China, November 2025. Association for Computational Linguistics

  52. [52]

    Musr: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv:2310.16049,

    Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. MuSR: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv preprint arXiv:2310.16049, 2024

  53. [53]

    Challenging BIG- bench tasks and whether chain-of-thought can solve them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging BIG- bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, Toronto, Canada, July 2023. Association for...

  54. [54]

    Evaluating general-purpose AI with psychometrics.arXiv preprint arXiv:2310.16379, 2023

    Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, David Stillwell, Luning Sun, Fang Luo, and Xing Xie. Evaluating general-purpose AI with psychometrics.arXiv preprint arXiv:2310.16379, 2023

  55. [55]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark.arXiv preprint arXiv:2406.01574, 2024

  56. [56]

    Davis, Benjamin W

    Mike Wu, Richard L. Davis, Benjamin W. Domingue, Chris Piech, and Noah Goodman. Varia- tional item response theory: Fast, accurate, and expressive.arXiv preprint arXiv:2002.00276, 2020

  57. [57]

    Skyler Wu, Yash Nair, and Emmanuel J. Candès. Efficient evaluation of LLM performance with statistical guarantees.arXiv preprint arXiv:2601.20251, 2026

  58. [58]

    Latency-response theory model: Evaluating large language models via response accuracy and chain-of-thought length.arXiv preprint arXiv:2512.07019, 2025

    Zhiyu Xu, Jia Liu, Yixin Wang, and Yuqi Gu. Latency-response theory model: Evaluating large language models via response accuracy and chain-of-thought length.arXiv preprint arXiv:2512.07019, 2025

  59. [59]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

  60. [60]

    Yi: Open Foundation Models by 01.AI

    Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang, Heng Li, Jiangcheng Zhu, Jianqun Chen, et al. Yi: Open foundation models by 01.AI.arXiv preprint arXiv:2403.04652, 2025

  61. [61]

    SKILL-MIX: A flexible and expandable family of evaluations for AI models

    Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh Goyal, and Sanjeev Arora. SKILL-MIX: A flexible and expandable family of evaluations for AI models. InThe Twelfth International Conference on Learning Representations, 2024. 13

  62. [62]

    Redundancy principles for MLLMs benchmarks

    Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, and Guangtao Zhai. Redundancy principles for MLLMs benchmarks. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12492–12504, Vienna, Austria, 2025. Association for Computational L...

  63. [63]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

  64. [64]

    Collins, Yael Moros-Daval, Seraphina Zhang, Qinlin Zhao, Yitian Huang, Luning Sun, Jonathan E

    Lexin Zhou, Lorenzo Pacchiardi, Fernando Martínez-Plumed, Katherine M. Collins, Yael Moros-Daval, Seraphina Zhang, Qinlin Zhao, Yitian Huang, Luning Sun, Jonathan E. Prunty, et al. General scales unlock AI evaluation with explanatory and predictive power.Nature, 652:58–67, Apr 2026

  65. [65]

    PromptBench: A unified library for evaluation of large language models.Journal of Machine Learning Research, 25(254):1–22, 2024

    Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, and Xing Xie. PromptBench: A unified library for evaluation of large language models.Journal of Machine Learning Research, 25(254):1–22, 2024

  66. [66]

    {problem}

    Yan Zhuang, Qi Liu, Zachary Pardos, Patrick C. Kyllonen, Jiyun Zu, Zhenya Huang, Shijin Wang, and Enhong Chen. Position: AI evaluation should learn from how we test humans. InProceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pages 82483–82508. PMLR, 13–19 Jul 2025. 14 A Appendices...