Recognition: 2 theorem links
· Lean TheoremWhen LLMs get significantly worse: A statistical approach to detect model degradations
Pith reviewed 2026-05-16 06:00 UTC · model grok-4.3
The pith
A hypothesis test on paired per-sample outputs detects real LLM degradations after optimization while controlling false positives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that confronting models on each individual sample with McNemar's test yields a statistically valid procedure for declaring degradation, because it directly counts the discordant pairs where the optimized model errs and the original succeeds (or vice versa) and tests whether that imbalance exceeds what random noise would produce, while the aggregation rules across tasks preserve the overall false-positive guarantee.
What carries the argument
McNemar's test on paired per-sample correctness indicators, which counts cases of disagreement between the two models and tests whether one model is systematically better.
If this is right
- Accuracy drops of 0.3 percent can be declared statistically significant degradations rather than noise.
- The procedure controls the overall false-positive rate when decisions are made across multiple benchmarks.
- Three aggregation methods allow a single accept/reject decision while preserving the per-test guarantees.
- The test distinguishes harmless numerical variation from genuine quality loss in temperature-zero evaluations.
Where Pith is reading between the lines
- The same paired-sample logic could be applied to detect degradation after fine-tuning or continued pre-training rather than only inference optimizations.
- Adoption would likely make practitioners more cautious about aggressive quantization or pruning once small drops become detectable.
- The framework assumes evaluation data remain stationary; any drift in the test set itself would require additional corrections not addressed here.
Load-bearing premise
That observed differences between the paired outputs of the original and optimized models arise only from the optimization itself and satisfy the exchangeability conditions required by McNemar's test.
What would settle it
Run the test on a model pair known to be identical except for a provably lossless transformation that introduces no accuracy change; if the procedure rejects the null at the target significance level, the framework fails to control false positives.
Figures
read the original abstract
Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization. In all of these cases it is crucial to ensure that the model quality has not degraded. However, even at temperature zero, model generations are not necessarily robust even to theoretically lossless model optimizations due to numerical errors. We thus require statistical tools to decide whether a finite-sample accuracy deviation is an evidence of a model's degradation or whether it can be attributed to (harmless) noise in the evaluation. We propose a statistically sound hypothesis testing framework based on McNemar's test allowing to efficiently detect model degradations, while guaranteeing a controlled rate of false positives. The crucial insight is that we have to confront the model scores on each sample, rather than aggregated on the task level. Furthermore, we propose three approaches to aggregate accuracy estimates across multiple benchmarks into a single decision. We provide an implementation on top of the largely adopted open source LM Evaluation Harness and provide a case study illustrating that the method correctly flags degraded models, while not flagging model optimizations that are provably lossless. We find that with our tests even empirical accuracy degradations of 0.3% can be confidently attributed to actual degradations rather than noise.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a statistical hypothesis testing framework using McNemar's test on paired per-sample correct/incorrect labels to determine whether observed accuracy differences between an original LLM and an optimized version indicate true degradation or can be attributed to evaluation noise. The approach aims to control the false positive rate, proposes aggregation methods across benchmarks, provides an open-source implementation in the LM Evaluation Harness, and includes a case study demonstrating its ability to detect small degradations (0.3%) while sparing provably lossless optimizations.
Significance. This work addresses an important practical problem in LLM optimization research by offering a rigorous statistical method to validate model quality post-optimization. If the assumptions of McNemar's test are satisfied in the benchmark setting, the framework could become a standard tool for distinguishing signal from noise in accuracy evaluations. The provision of code in the LM Evaluation Harness and the empirical case study are notable strengths that enhance reproducibility and applicability.
major comments (2)
- [§3] §3 (McNemar test application): The claim of a controlled false-positive rate relies on the asymptotic chi-squared distribution of the McNemar statistic ((b-c)^2/(b+c)), which requires independent pairs. LLM benchmarks routinely contain correlated samples (multiple questions from the same passage or similar prompts), violating independence and potentially inflating type I error above the nominal alpha. This directly undermines the central guarantee and is not addressed by the case study on lossless models.
- [§4] §4 (aggregation across benchmarks): The three proposed aggregation approaches are not shown to preserve the type I error control when combining p-values or decisions from dependent benchmarks; without explicit multiple-testing correction or simulation under realistic dependence, the single-decision claim lacks support.
minor comments (2)
- [Abstract] Abstract: The three aggregation approaches are mentioned but not named or briefly described, reducing immediate clarity.
- [Implementation] Implementation section: No usage example or pseudocode is provided despite the open-source claim; a short snippet would aid adoption.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with specific plans for revision to strengthen the statistical foundations and empirical validation of the proposed framework.
read point-by-point responses
-
Referee: [§3] §3 (McNemar test application): The claim of a controlled false-positive rate relies on the asymptotic chi-squared distribution of the McNemar statistic ((b-c)^2/(b+c)), which requires independent pairs. LLM benchmarks routinely contain correlated samples (multiple questions from the same passage or similar prompts), violating independence and potentially inflating type I error above the nominal alpha. This directly undermines the central guarantee and is not addressed by the case study on lossless models.
Authors: We appreciate the referee's identification of this key assumption. The standard McNemar test and its asymptotic chi-squared approximation do require independent paired observations, and LLM benchmarks commonly feature dependence arising from shared passages, similar prompts, or related questions. Our case study illustrates behavior on provably lossless optimizations but does not quantify type I error inflation under realistic dependence. In the revised manuscript we will expand §3 with an explicit discussion of this limitation. We will also add Monte Carlo simulation experiments that inject controlled dependence structures (e.g., by correlating per-sample error indicators across related items) to measure the realized false-positive rate. If the simulations reveal material inflation, we will report conservative alpha adjustments or alternative procedures; otherwise we will document the observed robustness. revision: yes
-
Referee: [§4] §4 (aggregation across benchmarks): The three proposed aggregation approaches are not shown to preserve the type I error control when combining p-values or decisions from dependent benchmarks; without explicit multiple-testing correction or simulation under realistic dependence, the single-decision claim lacks support.
Authors: We concur that the aggregation methods require explicit validation under inter-benchmark dependence to support the single-decision guarantee. The current manuscript presents the three aggregation strategies without such checks. We will revise §4 to include simulation studies that generate dependent p-values or binary decisions across benchmarks (modeling realistic correlation induced by shared model behavior). These experiments will verify whether the combined type I error stays at or below the nominal level. Where appropriate, we will incorporate standard multiple-testing corrections (e.g., Bonferroni or FDR control) into the aggregation rules and report the resulting operating characteristics. revision: yes
Circularity Check
No significant circularity; standard McNemar application to paired samples
full rationale
The paper proposes applying the established McNemar's test directly to paired per-sample correct/incorrect outcomes between two models, without deriving new statistics, fitting parameters that are then called predictions, or relying on self-citations for the core validity claim. The hypothesis-testing guarantee follows from the standard asymptotic chi-squared property of the McNemar statistic under its usual assumptions, which are external to the paper. No equations reduce the proposed decision rule to the input data by construction, and the aggregation methods across benchmarks are presented as straightforward combinations rather than self-referential derivations. The work is therefore self-contained as an application of existing statistical tools.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption McNemar's test assumptions hold for paired LLM outputs on identical prompts
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a statistically sound hypothesis testing framework based on McNemar's test... exact one-sided McNemar test... ˆq↓ = b/(b+c) ... SNR := sqrt(N/p↕)·δ
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery theorem unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Fact 1 (Model Degradation). The accuracy β ... is worse ... iff the degradation probability q↓ > 1/2
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv:2508.10925,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau1, Laurent Sifre1, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv 2302.01318,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv 2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
The Language Model Evaluation Harness
URLhttps://zenodo.org/records/12608602. Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, and Bharath K Sriperumbudur. Optimal kernel choice for large-scale two-sample tests.Advances in neural information processing systems, 25,
-
[5]
11 Published as a conference paper at ICLR 2026 Jonas M. K¨ubler, Yu-Xiang Wang, Shoham Sabach, Navid Ansari, Matth¨aus Kleindessner, Kailash Budhathoki, V olkan Cevher, and George Karypis. A proximal operator for inducing 2:4-sparsity. Transactions on Machine Learning Research,
work page 2026
-
[6]
Solving Quantitative Reasoning Problems with Language Models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dye, Henryk Michalewski, Vinay Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with lan- guage models.arXiv:2206.14858,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Association for Computational Linguistics. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv:2412.19437,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Alice Meshbane and John D Morris. Predictive discriminant analysis versus logistic regression in two-group classification problems.Annual Meeting of the American Educational Research Association (New York, NY, April 8-12, 1996).,
work page 1996
-
[9]
Adding error bars to evals: A statistical approach to language model evaluations
Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations. arXiv:2411.00640,
-
[10]
Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond
Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. Abstractive text summarization using sequence-to-sequence rnns and beyond.arXiv:1602.06023,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv:2311.12022,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Musr: Testing the limits of chain-of-thought with multistep soft reasoning.ArXiv, abs/2310.16049,
12 Published as a conference paper at ICLR 2026 Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv:2310.16049,
-
[13]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big- bench tasks and whether chain-of-thought can solve them.arXiv:2210.09261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.arXiv:2406.01574,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-a...
work page 2020
-
[16]
Association for Computational Linguistics. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv:2505.09388, 2025a. Jing Yang, Ruibo Wang, Yijun Song, and Jihong Li. Block-regularized 5×2 cross-validated mcne- mar’s test for comparing two classification al...
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv:2311.07911,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
A Survey on Efficient Inference for Large Language Models
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models.arXiv:2404.14294,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
13 Published as a conference paper at ICLR 2026 A NOTATION ANDALGORITHMS In Table 3 we summarize the notation that we use throughout the paper and Algorithms 1 to 4 describe our (aggregation) algorithms. Table 3: Notation Summary Symbol Description Formula M, ˜MBaseline and optimized models – NTotal sample size – TNumber of tasks – a,b,c,dContingency tabl...
work page 2026
-
[20]
(13) Denoting the population probabilities withP a,P b,P c,P d we have E[D(X)] =P b −P c,(14) E[D2(X)] =P b +P c.(15) 14 Published as a conference paper at ICLR 2026 Algorithm 3Fisher Aggregation Test Require:Listsb list,c list forTtasks. Significance levelα. 1:fori= 1toTdo 2:p i ←McNemar(b=b list[i], c=c list[i], α) 3:χ 2 stat ← −2PT i=1 ln(pi) 4:returnp...
work page 2026
-
[21]
We refer to the HuggingFace model cards for full information about their configurations
First we observe 15 Published as a conference paper at ICLR 2026 Table 4: Full model checkpoint specifiers used in the experiments. We refer to the HuggingFace model cards for full information about their configurations. ID HuggingFace Model Repository Llama-3.1 8B meta-llama/Llama-3.1-8B-Instruct w4a16 RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 ...
work page 2026
-
[22]
Algorithm 5Permutation Pooled Test Require:Score listsL M = [ ˆLM(x1),
standardizes per-task differences 7https://artificialanalysis.ai/methodology/intelligence-benchmarking# intelligence-index-evaluation-suite-summary 8https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals 21 Published as a conference paper at ICLR 2026 by their standard errors and uses the maximum standardized difference as the test statistic, analogous...
work page 2026
-
[23]
We therefore compare the 20B model against a rerun and against a version with FP8 KV cache
Since the GPT-OSS models already come with their MoE modules in MX-FP4 precision by default, we could not find meaningful models that are further quan- tized. We therefore compare the 20B model against a rerun and against a version with FP8 KV cache. We also include a pruned variant which only has 7 experts per MoE layer gpt-oss-6.0b-specialized-all-prune...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.