A McNemar-based statistical test detects real degradations in optimized LLMs with controlled false positives, even for accuracy changes as small as 0.3%.
We therefore compare the 20B model against a rerun and against a version with FP8 KV cache
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
stat.ML 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
When LLMs get significantly worse: A statistical approach to detect model degradations
A McNemar-based statistical test detects real degradations in optimized LLMs with controlled false positives, even for accuracy changes as small as 0.3%.