arxiv: 2605.07046 · v1 · submitted 2026-05-07 · 📊 stat.ML · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

An Interpretable and Scalable Framework for Evaluating Large Language Models

Hao Zeng, Qiang Heng, Xiaoqian Liu, Xinhao Qu

Pith reviewed 2026-05-11 01:05 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG

keywords large language modelsitem response theorymatrix factorizationmodel evaluationscalabilityinterpretabilitymajorization-minimizationbenchmark design

0 comments

The pith

Reformulating item response theory as constrained matrix factorization lets researchers evaluate large language models faster and more stably.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that applying the majorization-minimization principle turns the slow, unstable parameter estimation of item response theory into a sequence of constrained matrix factorization subproblems. This matters for a sympathetic reader because standard LLM benchmarks rely on simple accuracy averages that ignore output variability and item differences, while conventional IRT methods cannot scale to large leaderboards or many models. The reformulation supplies theoretical guarantees on identifiability and convergence, produces orders-of-magnitude speedups, and preserves or improves accuracy on synthetic data and real benchmarks such as MATH-500 and Open LLM Leaderboard tasks. The resulting estimates also reveal item difficulty and discrimination patterns that align with known scaling laws and can guide better benchmark construction.

Core claim

Our approach reformulates the problem as a sequence of constrained matrix factorization subproblems, enabling stable and efficient parameter estimation with theoretical guarantees for identifiability and convergence. Experiments on synthetic and real-world datasets, including MATH-500 and six Open LLM Leaderboard benchmarks, demonstrate that our method achieves superior scalability and interpretability. It delivers orders-of-magnitude speedups over competing methods while maintaining comparable or even higher estimation accuracy.

What carries the argument

The majorization-minimization principle that converts item response theory estimation into a sequence of constrained matrix factorization subproblems.

If this is right

Computation time for full IRT-based evaluation drops by orders of magnitude.
Parameter estimates remain stable and at least as accurate as prior methods.
Item difficulty and discrimination values become available to inform benchmark design.
The estimates are consistent with established LLM scaling laws.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The speed gain could make routine IRT analysis feasible for every new model release and every large benchmark.
Item-level insights might help construct balanced test sets that avoid over-representing easy or hard items.
If the underlying response model holds, the same factorization approach could extend to evaluating other stochastic systems such as reinforcement-learning agents.

Load-bearing premise

That the standard item response theory model, with its logistic relation between latent ability and item parameters, adequately describes LLM response patterns.

What would settle it

A new benchmark where the ability estimates produced by this method fail to predict performance on held-out items or deviate from observed scaling-law trends.

Figures

Figures reproduced from arXiv: 2605.07046 by Hao Zeng, Qiang Heng, Xiaoqian Liu, Xinhao Qu.

**Figure 2.** Figure 2: cBMM-recovered model abilities θˆ for LLMs in the MATH-500 benchmark suite. The three subplots (left, middle, right) correspond to three model scale tiers: small (0–7B), medium (7–20B), and large (>20B). All panels span the same release date range, from July 2023 to September 2025. LLMs with the highest θˆ are highlighted in each subplot, along with the LLM with the lowest θˆ in the large-size subplot. whi… view at source ↗

**Figure 3.** Figure 3: Distribution of cBMM-recovered item difficulties (−bˆ) across human-annotated difficulty levels (1-5) in the MATH-500 benchmark suite, shown for the overall and subject-specific results [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Spearman correlation between cBMM-recovered model abilities and average accuracies v.s. the coefficient of variation of cBMM-recovered aˆ. The left panel shows results for the top 20% of models selected by average accuracy, while the right panel includes all LLMs. Each scatter point corresponds to a benchmark suite, as annotated in the plot. Fine-grained ranking shifts under discrimination heterogeneity [… view at source ↗

**Figure 5.** Figure 5: Ranking flow between average accuracy (left column in each panel) and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Evaluation of large language models (LLMs) is increasingly critical, yet standard benchmarking methods rely on average accuracy, overlooking both the inherent stochasticity of LLM outputs and the heterogeneity of benchmark items. Item Response Theory (IRT) offers a principled framework for modeling latent model abilities and item characteristics, but conventional methods are computationally expensive and numerically unstable, limiting large-scale implementations. To address these challenges, we propose an interpretable and scalable framework for LLM evaluation based on the majorization-minimization principle. Our approach reformulates the problem as a sequence of constrained matrix factorization subproblems, enabling stable and efficient parameter estimation with theoretical guarantees for identifiability and convergence. Experiments on synthetic and real-world datasets, including MATH-500 and six Open LLM Leaderboard benchmarks, demonstrate that our method achieves superior scalability and interpretability. It delivers orders-of-magnitude speedups over competing methods while maintaining comparable or even higher estimation accuracy. Our results align with established scaling laws and offer insights into item difficulty and discrimination, informing more principled benchmark design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper recasts IRT estimation for LLMs as constrained matrix factorization via majorization-minimization, delivering real speed gains and identifiability proofs, but the whole thing stands or falls on whether standard IRT actually fits LLM response patterns.

read the letter

The core advance here is turning conventional IRT fitting into a sequence of constrained matrix factorization subproblems solved by majorization-minimization. That move brings identifiability and convergence guarantees plus the reported orders-of-magnitude speedups on both synthetic data and real LLM benchmarks like MATH-500 and the Open LLM Leaderboard. The experiments show accuracy that holds up against slower baselines, and the extracted item difficulty and discrimination parameters give some concrete interpretability that average accuracy scores lack.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce a scalable IRT-based framework for LLM evaluation by reformulating latent ability and item parameter estimation as a sequence of constrained matrix factorization subproblems solved via majorization-minimization. This yields theoretical guarantees for identifiability and convergence, orders-of-magnitude speedups over baselines, and comparable or superior accuracy on synthetic data plus real benchmarks (MATH-500 and six Open LLM Leaderboard tasks), while providing interpretable insights into item difficulty and discrimination that align with scaling laws.

Significance. If the central claims hold, the work would enable efficient large-scale IRT application to LLM evaluation, moving beyond average accuracy to model stochasticity and item heterogeneity. The matrix-factorization reformulation addresses a key computational barrier in traditional IRT, and the empirical results on established benchmarks plus alignment with scaling laws are strengths. However, these benefits are conditional on the IRT model being appropriate for LLM response patterns.

major comments (2)

[Abstract and §3] Abstract and §3 (method): The abstract asserts 'theoretical guarantees for identifiability and convergence' from the majorization-minimization reformulation, but no derivation, proof sketch, or reference to supplementary material is provided. This is load-bearing for the scalability and stability claims.
[§5] §5 (experiments): No error bars, statistical significance tests, or ablation on IRT model fit (e.g., goodness-of-fit diagnostics, residual analysis, or comparison to non-IRT baselines) are reported for the real datasets (MATH-500, Open LLM Leaderboard). Without these, the claim of 'higher estimation accuracy' cannot be distinguished from optimization artifacts or model misspecification.

minor comments (2)

[§3] The description of the constraint sets in the matrix factorization subproblems would benefit from explicit notation or a small illustrative example in the main text.
[§4] Consider adding a brief discussion of how the method handles the stochasticity of LLM outputs beyond the standard logistic IRT link.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the presentation of our work. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): The abstract asserts 'theoretical guarantees for identifiability and convergence' from the majorization-minimization reformulation, but no derivation, proof sketch, or reference to supplementary material is provided. This is load-bearing for the scalability and stability claims.

Authors: We agree that an explicit derivation strengthens the claims. The majorization-minimization reformulation yields identifiability from the constrained matrix factorization (via uniqueness of the low-rank decomposition under the given constraints) and convergence from the standard MM properties (monotonic decrease of the objective and boundedness). In the revised version, we add a concise proof sketch to §3 and explicitly reference the supplementary material containing the full proofs. revision: yes
Referee: [§5] §5 (experiments): No error bars, statistical significance tests, or ablation on IRT model fit (e.g., goodness-of-fit diagnostics, residual analysis, or comparison to non-IRT baselines) are reported for the real datasets (MATH-500, Open LLM Leaderboard). Without these, the claim of 'higher estimation accuracy' cannot be distinguished from optimization artifacts or model misspecification.

Authors: We acknowledge the need for greater statistical transparency on real data. In the revision, we will report error bars (standard deviations across repeated runs with different initializations), conduct statistical significance tests (paired t-tests against baselines), and add model-fit diagnostics including residual analysis and a direct comparison to non-IRT baselines such as mean accuracy. These additions will be included for both MATH-500 and the Open LLM Leaderboard tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained optimization reformulation

full rationale

The paper reformulates standard IRT parameter estimation for LLM responses as a sequence of constrained matrix-factorization subproblems solved by majorization-minimization, claiming identifiability, convergence, and scalability gains. No quoted step defines target quantities in terms of fitted outputs, renames a known result, imports uniqueness via self-citation, or smuggles an ansatz; the central claims rest on the external IRT model plus the new solver, with experiments on MATH-500 and Open LLM Leaderboard providing independent validation. This is the common honest case of an applied optimization contribution whose correctness hinges on model assumptions rather than internal circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the standard IRT probabilistic model for binary or graded responses and on the majorization-minimization principle for non-convex optimization; no new entities are postulated.

free parameters (1)

latent ability and item parameters
Estimated from data via the proposed solver; not fixed in advance.

axioms (1)

domain assumption LLM responses are generated according to an IRT model with logistic or similar link
Invoked to justify modeling latent abilities and item characteristics from observed correctness data.

pith-pipeline@v0.9.0 · 5481 in / 1182 out tokens · 42704 ms · 2026-05-11T01:05:24.102350+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
reformulates the problem as a sequence of constrained matrix factorization subproblems... majorization-minimization principle... quadratic majorization g(X|˜X)=L/2∥X−˜Y∥²_F(Ω)... cBMM algorithm
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear
Theorem 1 (Structural preservation). The latent probabilistic model (1) is identifiable up to an equivalence class if θ≠c and a≥0 with a≠0.

Reference graph

Works this paper leans on

66 extracted references · 22 canonical work pages · 13 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, et al

Berk Atıl, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J. Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, et al. Non-determinism of ‘deterministic’ LLM system settings in hosted environments. InProceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems, pages 135–148, Mumbai, India, December ...

2025
[3]

Hédy Attouch, Jérôme Bolte, and Benar Fux Svaiter. Convergence of descent methods for semi-algebraic and tame problems: Proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods.Mathematical Programming, 137(1):91–129, 2013

2013
[4]

Baker and Seock-Ho Kim, editors.Item Response Theory: Parameter Estimation Techniques

Frank B. Baker and Seock-Ho Kim, editors.Item Response Theory: Parameter Estimation Techniques. CRC Press, 2nd edition, 2004

2004
[5]

Balunović, J

Mislav Balunovi ´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi ´c, and Martin Vechev. MathArena: Evaluating LLMs on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2026

work page arXiv 2026
[6]

Bhaskar and Adel Javanmard

Sonia A. Bhaskar and Adel Javanmard. 1-bit matrix completion under exact low-rank constraint. In2015 49th Annual Conference on Information Sciences and Systems (CISS), pages 1–6, 2015

2015
[7]

Proximal alternating linearized minimization for nonconvex and nonsmooth problems.Mathematical Programming, 146(1):459–494, 2014

Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems.Mathematical Programming, 146(1):459–494, 2014

2014
[8]

Absil, and Rodolphe Sepulchre

Nicolas Boumal, Bamdev Mishra, P.-A. Absil, and Rodolphe Sepulchre. Manopt, a Matlab toolbox for optimization on manifolds.Journal of Machine Learning Research, 15(42):1455– 1459, 2014

2014
[9]

Ullman, Fernando Martinez-Plumed, Joshua B

Ryan Burnell, Wout Schellaert, John Burden, Tomer D. Ullman, Fernando Martinez-Plumed, Joshua B. Tenenbaum, Danaja Rutar, Lucy G. Cheke, Jascha Sohl-Dickstein, Melanie Mitchell, et al. Rethink reporting of evaluation results in AI.Science, 380(6641):136–138, 2023

2023
[10]

Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu

Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization.SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995

1995
[11]

A max-norm constrained minimization approach to 1-bit matrix completion.Journal of Machine Learning Research, 14(78):3619–3647, 2013

Tony Cai and Wen-Xin Zhou. A max-norm constrained minimization approach to 1-bit matrix completion.Journal of Machine Learning Research, 14(78):3619–3647, 2013

2013
[12]

Rethinking math benchmarks: Implications for AI in education

Jane Castleman, Nimra Nadeem, Tanvi Namjoshi, and Lydia Liu. Rethinking math benchmarks: Implications for AI in education. InProceedings of the Innovation and Responsibility in AI- Supported Education Workshop, volume 273 ofProceedings of Machine Learning Research, pages 66–82. PMLR, 03 Mar 2025

2025
[13]

When benchmarks leak: Inference-time decontamina- tion for LLMs.arXiv preprint arXiv:2601.19334, 2026

Jianzhe Chai, Yu Zhe, and Jun Sakuma. When benchmarks leak: Inference-time decontamina- tion for LLMs.arXiv preprint arXiv:2601.19334, 2026

work page arXiv 2026
[14]

Philip Chalmers

R. Philip Chalmers. mirt: A multidimensional item response theory package for the R environ- ment.Journal of Statistical Software, 48(6):1–29, 2012

2012
[15]

A Simple Framework for Contrastive Learning of Visual Representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations.arXiv preprint arXiv:2002.05709, 2020

work page internal anchor Pith review arXiv 2002
[16]

Item response theory—a statistical framework for educational and psychological measurement.Statistical Science, 40(2):167 – 194, 2025

Yunxiao Chen, Xiaoou Li, Jingchen Liu, and Zhiliang Ying. Item response theory—a statistical framework for educational and psychological measurement.Statistical Science, 40(2):167 – 194, 2025. 10

2025
[17]

A survey on large language mod- els for critical societal domains: Finance, healthcare, and law.arXiv preprint arXiv:2405.01769, 2024

Zhiyu Zoey Chen, Jing Ma, Xinlu Zhang, Nan Hao, An Yan, Armineh Nourbakhsh, Xianjun Yang, Julian McAuley, Linda Petzold, and William Yang Wang. A survey on large language mod- els for critical societal domains: Finance, healthcare, and law.arXiv preprint arXiv:2405.01769, 2024

work page arXiv 2024
[18]

Introducing SWE-bench verified, 2024

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, et al. Introducing SWE-bench verified, 2024

2024
[19]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters

Mark A. Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 1-bit matrix completion.Information and Inference: A Journal of the IMA, 3(3):189–223, 07 2014

2014
[21]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 9 1977

1977
[22]

Large language model agents in finance: A survey bridging research, practice, and real-world deployment

Yifei Dong, Fengyi Wu, Kunlin Zhang, Yilong Dai, Sanjian Zhang, Wanghao Ye, Sihan Chen, and Zhi-Qi Cheng. Large language model agents in finance: A survey bridging research, practice, and real-world deployment. InFindings of the Association for Computational Lin- guistics: EMNLP 2025, pages 17889–17907, Suzhou, China, November 2025. Association for Comput...

2025
[23]

Can we trust AI benchmarks? an interdisciplinary review of current issues in AI evaluation.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 8(1):850–864, Oct

Maria Eriksson, Erasmo Purificato, Arman Noroozian, João Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. Can we trust AI benchmarks? an interdisciplinary review of current issues in AI evaluation.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 8(1):850–864, Oct. 2025

2025
[24]

Open LLM Leaderboard v2

Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open LLM Leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/ open_llm_leaderboard, 2024

2024
[25]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017

2017
[26]

Hambleton and Hariharan Swaminathan.Item Response Theory: Principles and Applications

Ronald K. Hambleton and Hariharan Swaminathan.Item Response Theory: Principles and Applications. Springer Dordrecht, Dordrecht, 1 edition, 1985

1985
[27]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

2021
[28]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

A tutorial on MM algorithms.The American Statistician, 58(1):30–37, 2004

David R Hunter and Kenneth Lange. A tutorial on MM algorithms.The American Statistician, 58(1):30–37, 2004

2004
[30]

David Ili´c and Gilles E. Gignac. Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement?Intelligence, 106:101858, 2024

2024
[31]

SWE-bench: Can language models resolve real-world Github issues? InThe Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world Github issues? InThe Twelfth International Conference on Learning Representations, 2024. 11

2024
[32]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[34]

Lim- itations of large language models in clinical problem-solving arising from inflexible reasoning

Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, and Danilo Bernardo. Lim- itations of large language models in clinical problem-solving arising from inflexible reasoning. Scientific Reports, 15(1):39426, 2025

2025
[35]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[36]

py-irt: A scalable item response theory library for Python.INFORMS Journal on Computing, 35(1):5–13, 2023

John Patrick Lalor and Pedro Rodriguez. py-irt: A scalable item response theory library for Python.INFORMS Journal on Computing, 35(1):5–13, 2023

2023
[37]

Hunter, and Ilsoon Yang

Kenneth Lange, David R. Hunter, and Ilsoon Yang. Optimization transfer using surrogate objective functions.Journal of Computational and Graphical Statistics, 9(1):1–20, 2000

2000
[38]

Lawson and Richard J

Charles L. Lawson and Richard J. Hanson.Solving Least Squares Problems. Society for Industrial and Applied Mathematics, 1995

1995
[39]

Estimating contamination via perplexity: Quantifying memorisation in language model evaluation.arXiv preprint arXiv:2309.10677, 2023

Yucheng Li. Estimating contamination via perplexity: Quantifying memorisation in language model evaluation.arXiv preprint arXiv:2309.10677, 2023

work page arXiv 2023
[40]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024

2024
[41]

Bridging artificial intelligence and biological sciences: A comprehensive review of large language models in bioinformatics.Briefings in Bioinformatics, 26, 07 2025

Anqi Lin, Junpu Ye, Chang Qi, Lingxuan Zhu, Weiming Mou, Wenyi Gan, Dongqiang Zeng, Bufu Tang, Mingjia Xiao, Guangdi Chu, et al. Bridging artificial intelligence and biological sciences: A comprehensive review of large language models in bioinformatics.Briefings in Bioinformatics, 26, 07 2025

2025
[42]

Chi, and Boaz Nadler

Xiaoqian Liu, Xu Han, Eric C. Chi, and Boaz Nadler. A majorization-minimization Gauss- Newton method for 1-bit matrix completion.Journal of Computational and Graphical Statistics, 34(3):1017–1029, 2025

2025
[43]

Une propriété topologique des sous-ensembles analytiques réels.Les Équations aux Dérivées Partielles, 117:87–89, 1963

Stanisław Łojasiewicz. Une propriété topologique des sous-ensembles analytiques réels.Les Équations aux Dérivées Partielles, 117:87–89, 1963

1963
[44]

Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, May 2022. Association for Comp...

2022
[45]

GAIA: a benchmark for general AI assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,International Conference on Learning Representations, volume 2024, pages 9025–9049, 2024

2024
[46]

Adversarial NLI: A new benchmark for natural language understanding

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial NLI: A new benchmark for natural language understanding. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online, July 2020. Association for Computational Linguistics

2020
[47]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, page...

2021
[48]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof Q&A benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review arXiv 2023
[49]

XSTest: A test suite for identifying exaggerated safety behaviours in large language models

Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

2024
[50]

Survey of vulnerabilities in large language models revealed by adversarial attacks,

Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael Abu- Ghazaleh. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844, 2023

work page arXiv 2023
[51]

Analyzing uncertainty of LLM-as-a-Judge: Interval evaluations with conformal prediction

Huanxin Sheng, Xinyi Liu, Hangfeng He, Jieyu Zhao, and Jian Kang. Analyzing uncertainty of LLM-as-a-Judge: Interval evaluations with conformal prediction. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 11286–11328, Suzhou, China, November 2025. Association for Computational Linguistics

2025
[52]

Musr: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv:2310.16049,

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. MuSR: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv preprint arXiv:2310.16049, 2024

work page arXiv 2024
[53]

Challenging BIG- bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging BIG- bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, Toronto, Canada, July 2023. Association for...

2023
[54]

Evaluating general-purpose AI with psychometrics.arXiv preprint arXiv:2310.16379, 2023

Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, David Stillwell, Luning Sun, Fang Luo, and Xing Xie. Evaluating general-purpose AI with psychometrics.arXiv preprint arXiv:2310.16379, 2023

work page arXiv 2023
[55]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark.arXiv preprint arXiv:2406.01574, 2024

work page internal anchor Pith review arXiv 2024
[56]

Davis, Benjamin W

Mike Wu, Richard L. Davis, Benjamin W. Domingue, Chris Piech, and Noah Goodman. Varia- tional item response theory: Fast, accurate, and expressive.arXiv preprint arXiv:2002.00276, 2020

work page arXiv 2002
[57]

Skyler Wu, Yash Nair, and Emmanuel J. Candès. Efficient evaluation of LLM performance with statistical guarantees.arXiv preprint arXiv:2601.20251, 2026

work page internal anchor Pith review arXiv 2026
[58]

Latency-response theory model: Evaluating large language models via response accuracy and chain-of-thought length.arXiv preprint arXiv:2512.07019, 2025

Zhiyu Xu, Jia Liu, Yixin Wang, and Yuqi Gu. Latency-response theory model: Evaluating large language models via response accuracy and chain-of-thought length.arXiv preprint arXiv:2512.07019, 2025

work page arXiv 2025
[59]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Yi: Open Foundation Models by 01.AI

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang, Heng Li, Jiangcheng Zhu, Jianqun Chen, et al. Yi: Open foundation models by 01.AI.arXiv preprint arXiv:2403.04652, 2025

work page internal anchor Pith review arXiv 2025
[61]

SKILL-MIX: A flexible and expandable family of evaluations for AI models

Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh Goyal, and Sanjeev Arora. SKILL-MIX: A flexible and expandable family of evaluations for AI models. InThe Twelfth International Conference on Learning Representations, 2024. 13

2024
[62]

Redundancy principles for MLLMs benchmarks

Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, and Guangtao Zhai. Redundancy principles for MLLMs benchmarks. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12492–12504, Vienna, Austria, 2025. Association for Computational L...

2025
[63]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Collins, Yael Moros-Daval, Seraphina Zhang, Qinlin Zhao, Yitian Huang, Luning Sun, Jonathan E

Lexin Zhou, Lorenzo Pacchiardi, Fernando Martínez-Plumed, Katherine M. Collins, Yael Moros-Daval, Seraphina Zhang, Qinlin Zhao, Yitian Huang, Luning Sun, Jonathan E. Prunty, et al. General scales unlock AI evaluation with explanatory and predictive power.Nature, 652:58–67, Apr 2026

2026
[65]

PromptBench: A unified library for evaluation of large language models.Journal of Machine Learning Research, 25(254):1–22, 2024

Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, and Xing Xie. PromptBench: A unified library for evaluation of large language models.Journal of Machine Learning Research, 25(254):1–22, 2024

2024
[66]

{problem}

Yan Zhuang, Qi Liu, Zachary Pardos, Patrick C. Kyllonen, Jiyun Zu, Zhenya Huang, Shijin Wang, and Enhong Chen. Position: AI evaluation should learn from how we test humans. InProceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pages 82483–82508. PMLR, 13–19 Jul 2025. 14 A Appendices...

2025