Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

Jingling Yuan; Junhao Dong; Tian Zhang; Yujia Tong; Yunyang Wan; Yuxi Wang

arxiv: 2606.01850 · v1 · pith:H36ORTLFnew · submitted 2026-06-01 · 💻 cs.AI

Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

Yujia Tong , Yuxi Wang , Yunyang Wan , Tian Zhang , Junhao Dong , Jingling Yuan This is my paper

Pith reviewed 2026-06-28 15:03 UTC · model grok-4.3

classification 💻 cs.AI

keywords model compressionquantizationpruninguncertainty quantificationconformal predictionlarge language modelsNLP evaluation

0 comments

The pith

Compression frequently decouples accuracy from uncertainty in large language models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks twelve large language models under quantization and pruning across five NLP tasks to determine whether these compression methods preserve the models' ability to quantify uncertainty. It applies conformal prediction to obtain a distribution-free uncertainty measure. The results show that accuracy and uncertainty often become disconnected after compression, larger models absorb the induced uncertainty more effectively, and uncertainty inflation tends to occur in abrupt thresholds rather than gradually. This matters because safety-critical applications rely on reliable uncertainty estimates, so accuracy-only tests may miss deployment risks.

Core claim

Using conformal prediction on compressed LLMs, the study finds that quantization and pruning frequently decouple accuracy from uncertainty, that larger models absorb compression-induced uncertainty far more effectively than smaller ones, and that uncertainty inflation is often threshold-like rather than gradual.

What carries the argument

Conformal prediction, which supplies a distribution-free measure of uncertainty for the outputs of quantized and pruned LLMs.

If this is right

Accuracy preservation does not guarantee preservation of uncertainty under compression.
Larger models are more resilient to compression effects on uncertainty than smaller models.
Uncertainty inflation tends to occur abruptly rather than gradually as compression increases.
Accuracy-only evaluation is insufficient for assessing deployment readiness of compressed LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed threshold behavior implies that safe compression levels may need to be identified separately for each model size.
Uncertainty-aware checks could become part of standard compression pipelines beyond the accuracy focus used today.
The decoupling pattern may appear in other efficiency techniques such as distillation that were not tested here.

Load-bearing premise

Conformal prediction supplies a valid measure of uncertainty that remains applicable to the outputs of quantized and pruned LLMs.

What would settle it

Finding that uncertainty measures remain aligned with accuracy across all tested compression levels and model sizes would challenge the decoupling result.

Figures

Figures reproduced from arXiv: 2606.01850 by Jingling Yuan, Junhao Dong, Tian Zhang, Yujia Tong, Yunyang Wan, Yuxi Wang.

**Figure 2.** Figure 2: Accuracy change (∆Acc) vs. uncertainty change (∆SS) relative to uncompressed baselines for five models across all compression methods and tasks. If accuracy and uncertainty were coupled, points would concentrate along a consistent negative-slope trend; instead, substantial dispersion is observed in all subplots, confirming that compression frequently decouples the two metrics (Finding 1). Comparing across … view at source ↗

**Figure 3.** Figure 3: Prediction set size (SS) as a function of Wanda pruning sparsity (0%–50%) across five tasks [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Model compression techniques such as quantization and pruning are widely used to reduce the deployment cost of large language models (LLMs), with existing evaluations focusing almost exclusively on accuracy preservation. However, in safety-critical applications, a model's ability to reliably quantify its own uncertainty is equally important. We ask: does compression preserve this ability? To answer this question, we benchmark 12 LLMs under various compression configurations across five NLP tasks, using conformal prediction to provide a rigorous, distribution-free measure of uncertainty. Our experiments reveal that: (I) compression frequently decouples accuracy from uncertainty; (II) larger models absorb compression-induced uncertainty far more effectively than smaller ones; and (III) uncertainty inflation is often threshold-like rather than gradual. These results suggest that accuracy-only evaluation is insufficient for assessing the deployment readiness of compressed LLMs, and that uncertainty-aware benchmarking should be a standard component of model compression pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This benchmark finds that compression often decouples accuracy from uncertainty in LLMs when measured by conformal prediction.

read the letter

The main observation is that compression frequently separates a model's accuracy from its uncertainty estimates on these tasks. The authors run conformal prediction on 12 LLMs across five NLP tasks under quantization and pruning, and they report three patterns: the two metrics come apart, larger models absorb the uncertainty increase more readily, and the rise in uncertainty often looks threshold-like rather than gradual.

What the paper adds is a consistent application of conformal prediction as the uncertainty measure across multiple compression settings. Most prior compression work stops at accuracy or perplexity, so adding this check is a straightforward extension that highlights a practical gap.

The experiments cover enough models and tasks to make the pattern worth noticing. If the conformal coverage holds and the calibration sets are handled cleanly, the decoupling result gives a concrete reason to look beyond accuracy when certifying compressed models for safety-critical use.

The limitation is that the description supplies almost no information on how the conformal procedure was implemented, what the actual coverage rates were, or whether error bars or statistical tests accompany the claims. Without those pieces it is hard to judge the strength of the evidence. The finding stays suggestive until the details are checked.

This is for groups that already care about uncertainty quantification in deployed LLMs and want to see how compression interacts with it. A reader focused on evaluation methods will get a usable data point.

It deserves peer review because the question is timely and the empirical scope is wide enough to be worth referee time, even if the current writeup needs more on the method.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks 12 LLMs under quantization and pruning on five NLP tasks, using conformal prediction to measure uncertainty. It reports that compression frequently decouples accuracy from uncertainty, larger models absorb compression-induced uncertainty more effectively, and uncertainty inflation tends to be threshold-like rather than gradual, concluding that accuracy-only evaluations are insufficient for assessing compressed LLMs in safety-critical settings.

Significance. If the empirical results hold under detailed scrutiny, the work provides a useful demonstration that uncertainty quantification can degrade independently of accuracy under compression. The model-agnostic nature of conformal prediction strengthens the benchmark approach, and the scale (12 models, 5 tasks) offers a broad view that could encourage uncertainty-aware practices in compression pipelines.

major comments (2)

[Experimental methodology] Experimental methodology section: the manuscript invokes conformal prediction as a 'rigorous, distribution-free measure' but supplies no description of the nonconformity score, calibration-set construction, or how exchangeability is maintained for the specific NLP tasks and compressed model outputs. This detail is load-bearing for validating claims (I)-(III) on decoupling and threshold behavior.
[Results] Results section (across the 12 models and 5 tasks): the reported patterns of decoupling and threshold-like inflation are presented without statistical tests, error bars, or sensitivity analysis to data splits. This undermines the strength of the qualifiers 'frequently' and 'often' in the abstract and conclusions.

minor comments (2)

[Abstract] The abstract and introduction refer to 'various compression configurations' without enumerating the specific bit-widths, sparsity ratios, or methods (e.g., GPTQ vs. AWQ) tested; adding this would improve interpretability of the threshold-like behavior.
[Figures/Tables] Figure and table captions should explicitly state the number of random seeds or runs used to generate the plotted uncertainty metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional methodological detail and statistical support would strengthen the manuscript. We will revise accordingly to address both major comments.

read point-by-point responses

Referee: [Experimental methodology] Experimental methodology section: the manuscript invokes conformal prediction as a 'rigorous, distribution-free measure' but supplies no description of the nonconformity score, calibration-set construction, or how exchangeability is maintained for the specific NLP tasks and compressed model outputs. This detail is load-bearing for validating claims (I)-(III) on decoupling and threshold behavior.

Authors: We agree that the Experimental Methodology section requires explicit implementation details to support the claims. In the revised manuscript we will add a dedicated subsection specifying: the nonconformity score (1 minus the softmax probability assigned to the ground-truth label for the classification tasks used), the calibration-set construction (a fixed 20% random hold-out from each task's training split, kept disjoint from test data), and the exchangeability assumption (standard i.i.d. exchangeability within each task's data distribution, with a brief note on how compression does not alter this assumption for the purpose of coverage guarantees). These additions will directly bolster validation of claims (I)-(III). revision: yes
Referee: [Results] Results section (across the 12 models and 5 tasks): the reported patterns of decoupling and threshold-like inflation are presented without statistical tests, error bars, or sensitivity analysis to data splits. This undermines the strength of the qualifiers 'frequently' and 'often' in the abstract and conclusions.

Authors: We accept that the Results section would be strengthened by quantitative support for the reported patterns. In revision we will (i) add error bars derived from 5 independent random seeds for calibration/test splits, (ii) perform sensitivity analysis across three different calibration-set sizes, and (iii) include paired statistical tests (Wilcoxon signed-rank) comparing accuracy versus coverage gap under each compression level. These changes will provide empirical grounding for the qualifiers 'frequently' and 'often'. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This is a purely empirical benchmark paper with no derivations, equations, fitted parameters, or load-bearing self-citations. It applies standard conformal prediction (a model-agnostic, distribution-free procedure) to measure uncertainty on compressed LLMs and reports experimental observations across 12 models and five tasks. The central claims (decoupling of accuracy from uncertainty, etc.) are direct outputs of the benchmark results rather than quantities defined by or reduced to the evaluation method itself. No step reduces a prediction or result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that conformal prediction yields valid uncertainty sets for LLM outputs after compression; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Conformal prediction provides distribution-free uncertainty quantification valid for the tested LLM outputs
Central premise enabling the uncertainty measurements reported in the abstract.

pith-pipeline@v0.9.1-grok · 5698 in / 1057 out tokens · 26664 ms · 2026-06-28T15:03:14.358256+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 18 linked inside Pith

[1]

The Llama 3 herd of models,

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024
[2]

Gpt-4 technical report,

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[3]

Hierarchically robust zero-shot vision- language models,

J. Dong, Y . Zhang, H. Zhu, Y .-S. Ong, and P. Koniusz, “Hierarchically robust zero-shot vision- language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2026, pp. 37 642–37 652

2026
[4]

Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,

J. Dong, X. Qu, C. Zhang, S. Q. Rong, N. D. Thai, W. Pan, X. Li, T. Liu, P. Koniusz, and Y .-S. Ong, “Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,” inThe Fourteenth International Conference on Learning Representations, 2026

2026
[5]

Craft-lora: Content-style personalization via rank-constrained adaptation and training-free fusion,

Y . Li, Y . Cai, and C. Zhang, “Craft-lora: Content-style personalization via rank-constrained adaptation and training-free fusion,”arXiv preprint arXiv:2602.18936, 2026

arXiv 2026
[6]

Reason in chains, learn in trees: Self-rectification and grafting for multi-turn agent policy optimization,

Y . Li, S. Tang, and T. Lan, “Reason in chains, learn in trees: Self-rectification and grafting for multi-turn agent policy optimization,”arXiv preprint arXiv:2604.07165, 2026

Pith/arXiv arXiv 2026
[7]

Sage: Accelerating vision-language models via entropy-guided adaptive speculative decoding,

Y . Tong, T. Zhang, Y . Wan, K. Lin, J. Yuan, and C. Hu, “Sage: Accelerating vision-language models via entropy-guided adaptive speculative decoding,”arXiv preprint arXiv:2602.00523, 2026

arXiv 2026
[8]

A survey on model compression for large language models,

X. Zhu, J. Li, Y . Liu, C. Ma, and W. Wang, “A survey on model compression for large language models,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 1556–1577, 2024

2024
[9]

Optimal brain restoration for joint quantization and sparsification of llms,

H. Guo, Y . Li, and L. Benini, “Optimal brain restoration for joint quantization and sparsification of llms,”arXiv preprint arXiv:2509.11177, 2025

arXiv 2025
[10]

Quarot: Outlier-free 4-bit inference in rotated llms,

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman, “Quarot: Outlier-free 4-bit inference in rotated llms,”Advances in Neural Information Processing Systems, vol. 37, pp. 100 213–100 240, 2024

2024
[11]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantization for on-device llm compression and acceleration,”Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024

2024
[12]

Smoothquant: Accurate and efficient post-training quantization for large language models,

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 38 087–38 099

2023
[13]

Gpt3. int8 (): 8-bit matrix multiplica- tion for transformers at scale,

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Gpt3. int8 (): 8-bit matrix multiplica- tion for transformers at scale,”Advances in neural information processing systems, vol. 35, pp. 30 318–30 332, 2022

2022
[14]

Omniquant: Omnidirectionally calibrated quantization for large language models,

W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, G. Peng, Y . Qiao, and P. Luo, “Omniquant: Omnidirectionally calibrated quantization for large language models,” inInterna- tional Conference on Learning Representations, vol. 2024, 2024, pp. 45 472–45 496

2024
[15]

Slicegpt: Com- press large language models by deleting rows and columns,

S. Ashkboos, M. L. Croci, M. G. d. Nascimento, T. Hoefler, and J. Hensman, “Slicegpt: Com- press large language models by deleting rows and columns,”arXiv preprint arXiv:2401.15024, 2024

arXiv 2024
[16]

Sparsegpt: Massive language models can be accurately pruned in one-shot,

E. Frantar and D. Alistarh, “Sparsegpt: Massive language models can be accurately pruned in one-shot,” inInternational conference on machine learning. PMLR, 2023, pp. 10 323–10 337

2023
[17]

Fluctuation-based adaptive structured pruning for large language models,

Y . An, X. Zhao, T. Yu, M. Tang, and J. Wang, “Fluctuation-based adaptive structured pruning for large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 10, 2024, pp. 10 865–10 873

2024
[18]

Language models (mostly) know what they know,

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield- Dodds, N. DasSarma, E. Tran-Johnsonet al., “Language models (mostly) know what they know,”arXiv preprint arXiv:2207.05221, 2022. 10

Pith/arXiv arXiv 2022
[19]

A tutorial on conformal prediction

G. Shafer and V . V ovk, “A tutorial on conformal prediction.”Journal of machine learning research, vol. 9, no. 3, 2008

2008
[20]

Conformal prediction: A gentle introduction,

A. N. Angelopoulos and S. Bates, “Conformal prediction: A gentle introduction,”Foundations and Trends in Machine Learning, vol. 16, no. 4, pp. 494–591, 2023

2023
[21]

Benchmarking llms via uncertainty quantification,

F. Ye, M. Yang, J. Pang, L. Wang, D. F. Wong, E. Yilmaz, S. Shi, and Z. Tu, “Benchmarking llms via uncertainty quantification,”Advances in Neural Information Processing Systems, vol. 37, pp. 15 356–15 385, 2024

2024
[22]

Conformal language modeling,

V . Quach, A. Fisch, T. Schuster, A. Yala, J. H. Sohn, T. Jaakkola, and R. Barzilay, “Conformal language modeling,” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 11 654–11 681

2024
[23]

Large language model validity via enhanced conformal prediction methods,

J. J. Cherian, I. Gibbs, and E. J. Candès, “Large language model validity via enhanced conformal prediction methods,”Advances in Neural Information Processing Systems, vol. 37, pp. 114 812– 114 842, 2024

2024
[24]

Quantifying deep learning model uncertainty in conformal predic- tion,

H. Karimi and R. Samavi, “Quantifying deep learning model uncertainty in conformal predic- tion,” inProceedings of the AAAI Symposium Series, vol. 1, no. 1, 2023, pp. 142–148

2023
[25]

Robust machine unlearning for quantized neural networks via adaptive gradient reweighting with similar labels,

Y . Tong, Y . Wang, J. Yuan, and C. Hu, “Robust machine unlearning for quantized neural networks via adaptive gradient reweighting with similar labels,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 20 603–20 612

2025
[26]

Enhancing quantization-aware training on edge devices via relative entropy coreset selection and cascaded layer correction,

Y . Tong, J. Yuan, and C. Hu, “Enhancing quantization-aware training on edge devices via relative entropy coreset selection and cascaded layer correction,”IEEE Transactions on Mobile Computing, 2026

2026
[27]

Forget by uncertainty: Orthogonal entropy unlearning for quantized neural networks,

T. Zhang, Y . Tong, J. Dong, K. Xu, Y . Wang, and J. Yuan, “Forget by uncertainty: Orthogonal entropy unlearning for quantized neural networks,”arXiv preprint arXiv:2602.00567, 2026

Pith/arXiv arXiv 2026
[28]

Data-free quantization of vision transformers via easy-to-hard synthesis and activation correction,

Y . Tong, J. Yuan, T. Zhang, J. Liu, and C. Hu, “Data-free quantization of vision transformers via easy-to-hard synthesis and activation correction,”ACM Transactions on Multimedia Computing, Communications and Applications, 2025

2025
[29]

A simple and effective pruning approach for large language models,

M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,”arXiv preprint arXiv:2306.11695, 2023

Pith/arXiv arXiv 2023
[30]

Gptq: Accurate post-training quantization for generative pre-trained transformers,

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,”arXiv preprint arXiv:2210.17323, 2022

Pith/arXiv arXiv 2022
[31]

Spinquant: Llm quantization with learned rotations,

Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V . Chandra, Y . Tian, and T. Blankevoort, “Spinquant: Llm quantization with learned rotations,”arXiv preprint arXiv:2405.16406, 2024

Pith/arXiv arXiv 2024
[32]

On the convergence of muon and beyond,

D. Chang, Y . Liu, and G. Yuan, “On the convergence of muon and beyond,”arXiv preprint arXiv:2509.15816, 2025

Pith/arXiv arXiv 2025
[33]

Muoneq: Balancing before orthogonalization with lightweight equilibration,

D. Chang, Q. Shi, L. Zhang, Y . Li, R. Zhang, Y . Lu, Y . Liu, and G. Yuan, “Muoneq: Balancing before orthogonalization with lightweight equilibration,”arXiv preprint arXiv:2603.28254, 2026

Pith/arXiv arXiv 2026
[34]

Mgup: A momentum-gradient alignment update policy for stochastic optimization,

D. Chang and G. Yuan, “Mgup: A momentum-gradient alignment update policy for stochastic optimization,”Advances in Neural Information Processing Systems, vol. 38, pp. 20 488–20 537, 2026

2026
[35]

Flatquant: Flatness matters for llm quantization,

Y . Sun, R. Liu, H. Bai, H. Bao, K. Zhao, Y . Li, J. Hu, X. Yu, L. Hou, C. Yuanet al., “Flatquant: Flatness matters for llm quantization,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 57 587–57 613

2025
[36]

Qlora: Efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”Advances in neural information processing systems, vol. 36, pp. 10 088–10 115, 2023

2023
[37]

Atom: Low-bit quantization for efficient and accurate llm serving,

Y . Zhao, C.-Y . Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, and B. Kasikci, “Atom: Low-bit quantization for efficient and accurate llm serving,”Proceedings of Machine Learning and Systems, vol. 6, pp. 196–209, 2024

2024
[38]

Llm-pruner: On the structural pruning of large language models,

X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural pruning of large language models,” Advances in neural information processing systems, vol. 36, pp. 21 702–21 720, 2023. 11

2023
[39]

On verbalized confidence scores for llms,

D. Yang, Y .-H. H. Tsai, and M. Yamada, “On verbalized confidence scores for llms,”arXiv preprint arXiv:2412.14737, 2024

Pith/arXiv arXiv 2024
[40]

Ensemble based systems in decision making,

R. Polikar, “Ensemble based systems in decision making,”IEEE Circuits and systems magazine, vol. 6, no. 3, pp. 21–45, 2006

2006
[41]

Simple and scalable predictive uncertainty estimation using deep ensembles,

B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,”Advances in neural information processing systems, vol. 30, 2017

2017
[42]

Probabilistic forecasting using monte carlo dropout neural networks,

C. Serpell, I. Araya, C. Valle, and H. Allende, “Probabilistic forecasting using monte carlo dropout neural networks,” inIberoamerican congress on pattern recognition. Springer, 2019, pp. 387–397

2019
[43]

Dropout as a bayesian approximation: Representing model uncer- tainty in deep learning,

Y . Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncer- tainty in deep learning,” ininternational conference on machine learning. PMLR, 2016, pp. 1050–1059

2016
[44]

Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping,

S. Chen, Y . Guo, Y . Ye, S. Huang, W. Hu, H. Li, M. Zhang, J. Chen, S. Guo, and N. Peng, “Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping,”arXiv preprint arXiv:2510.08457, 2025

arXiv 2025
[45]

Evidential deep learning to quantify classification uncertainty,

M. Sensoy, L. Kaplan, and M. Kandemir, “Evidential deep learning to quantify classification uncertainty,”Advances in neural information processing systems, vol. 31, 2018

2018
[46]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,

L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,”arXiv preprint arXiv:2302.09664, 2023

Pith/arXiv arXiv 2023
[47]

On calibration of modern neural networks,

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” inInternational conference on machine learning. PMLR, 2017, pp. 1321–1330

2017
[48]

Revisiting the calibration of modern neural networks,

M. Minderer, J. Djolonga, R. Romijnders, F. Hubis, X. Zhai, N. Houlsby, D. Tran, and M. Lu- cic, “Revisiting the calibration of modern neural networks,”Advances in neural information processing systems, vol. 34, pp. 15 682–15 694, 2021

2021
[49]

Copu: Conformal prediction for uncertainty quantification in natural language generation,

S. Wang, Y . Jiang, Y . Tang, L. Cheng, and H. Chen, “Copu: Conformal prediction for uncertainty quantification in natural language generation,”arXiv preprint arXiv:2502.12601, 2025

arXiv 2025
[50]

Conformal prediction with large language models for multi-choice question answering,

B. Kumar, C. Lu, G. Gupta, A. Palepu, D. Bellamy, R. Raskar, and A. Beam, “Conformal prediction with large language models for multi-choice question answering,”arXiv preprint arXiv:2305.18404, 2023

arXiv 2023
[51]

Non-exchangeable conformal language generation with nearest neighbors,

D. Ulmer, C. Zerva, and A. F. Martins, “Non-exchangeable conformal language generation with nearest neighbors,” inFindings of the Association for Computational Linguistics: EACL 2024, 2024, pp. 1909–1929

2024
[52]

Uncertainty quantification and confidence calibration in large language models: A survey,

X. Liu, T. Chen, L. Da, C. Chen, Z. Lin, and H. Wei, “Uncertainty quantification and confidence calibration in large language models: A survey,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 6107–6117

2025
[53]

Quantized can still be calibrated: A unified framework to calibration in quantized large language models,

M. Zhong, G. Wang, Y .-N. Chuang, and N. Zou, “Quantized can still be calibrated: A unified framework to calibration in quantized large language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 30 503–30 517

2025
[54]

Least ambiguous set-valued classifiers with bounded error levels,

M. Sadinle, J. Lei, and L. Wasserman, “Least ambiguous set-valued classifiers with bounded error levels,”Journal of the American Statistical Association, vol. 114, no. 525, pp. 223–234, 2019

2019
[55]

Classification with valid and adaptive coverage,

Y . Romano, M. Sesia, and E. Candes, “Classification with valid and adaptive coverage,”Ad- vances in neural information processing systems, vol. 33, pp. 3581–3591, 2020

2020
[56]

Measuring massive multitask language understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,”arXiv preprint arXiv:2009.03300, 2020

Pith/arXiv arXiv 2009
[57]

Cosmos qa: Machine reading compre- hension with contextual commonsense reasoning,

L. Huang, R. Le Bras, C. Bhagavatula, and Y . Choi, “Cosmos qa: Machine reading compre- hension with contextual commonsense reasoning,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 2019, pp. 2391–2401. 12

2019
[58]

Hellaswag: Can a machine really finish your sentence?

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi, “Hellaswag: Can a machine really finish your sentence?” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 4791–4800

2019
[59]

Halueval: A large-scale hallucination evaluation benchmark for large language models,

J. Li, X. Cheng, X. Zhao, J.-Y . Nie, and J.-R. Wen, “Halueval: A large-scale hallucination evaluation benchmark for large language models,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 6449–6464

2023
[60]

Llama 2: Open foundation and fine-tuned chat models,

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

Pith/arXiv arXiv 2023
[61]

Qwen3 technical report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[62]

Deepseek llm: Scaling open-source language models with longtermism,

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu et al., “Deepseek llm: Scaling open-source language models with longtermism,”arXiv preprint arXiv:2401.02954, 2024

Pith/arXiv arXiv 2024
[63]

The falcon series of open language models,

E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malarticet al., “The falcon series of open language models,”arXiv preprint arXiv:2311.16867, 2023

Pith/arXiv arXiv 2023
[64]

Holistic evaluation of language models,

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumaret al., “Holistic evaluation of language models,”arXiv preprint arXiv:2211.09110, 2022

Pith/arXiv arXiv 2022
[65]

A survey on evaluation of large language models,

Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM transactions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024. 13 A Appendix A.1 Prompting Strategies Following [21], we evaluate all models with prompt-based inference rather than task-s...

2024

[1] [1]

The Llama 3 herd of models,

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024

[2] [2]

Gpt-4 technical report,

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[3] [3]

Hierarchically robust zero-shot vision- language models,

J. Dong, Y . Zhang, H. Zhu, Y .-S. Ong, and P. Koniusz, “Hierarchically robust zero-shot vision- language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2026, pp. 37 642–37 652

2026

[4] [4]

Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,

J. Dong, X. Qu, C. Zhang, S. Q. Rong, N. D. Thai, W. Pan, X. Li, T. Liu, P. Koniusz, and Y .-S. Ong, “Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,” inThe Fourteenth International Conference on Learning Representations, 2026

2026

[5] [5]

Craft-lora: Content-style personalization via rank-constrained adaptation and training-free fusion,

Y . Li, Y . Cai, and C. Zhang, “Craft-lora: Content-style personalization via rank-constrained adaptation and training-free fusion,”arXiv preprint arXiv:2602.18936, 2026

arXiv 2026

[6] [6]

Reason in chains, learn in trees: Self-rectification and grafting for multi-turn agent policy optimization,

Y . Li, S. Tang, and T. Lan, “Reason in chains, learn in trees: Self-rectification and grafting for multi-turn agent policy optimization,”arXiv preprint arXiv:2604.07165, 2026

Pith/arXiv arXiv 2026

[7] [7]

Sage: Accelerating vision-language models via entropy-guided adaptive speculative decoding,

Y . Tong, T. Zhang, Y . Wan, K. Lin, J. Yuan, and C. Hu, “Sage: Accelerating vision-language models via entropy-guided adaptive speculative decoding,”arXiv preprint arXiv:2602.00523, 2026

arXiv 2026

[8] [8]

A survey on model compression for large language models,

X. Zhu, J. Li, Y . Liu, C. Ma, and W. Wang, “A survey on model compression for large language models,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 1556–1577, 2024

2024

[9] [9]

Optimal brain restoration for joint quantization and sparsification of llms,

H. Guo, Y . Li, and L. Benini, “Optimal brain restoration for joint quantization and sparsification of llms,”arXiv preprint arXiv:2509.11177, 2025

arXiv 2025

[10] [10]

Quarot: Outlier-free 4-bit inference in rotated llms,

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman, “Quarot: Outlier-free 4-bit inference in rotated llms,”Advances in Neural Information Processing Systems, vol. 37, pp. 100 213–100 240, 2024

2024

[11] [11]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantization for on-device llm compression and acceleration,”Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024

2024

[12] [12]

Smoothquant: Accurate and efficient post-training quantization for large language models,

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 38 087–38 099

2023

[13] [13]

Gpt3. int8 (): 8-bit matrix multiplica- tion for transformers at scale,

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Gpt3. int8 (): 8-bit matrix multiplica- tion for transformers at scale,”Advances in neural information processing systems, vol. 35, pp. 30 318–30 332, 2022

2022

[14] [14]

Omniquant: Omnidirectionally calibrated quantization for large language models,

W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, G. Peng, Y . Qiao, and P. Luo, “Omniquant: Omnidirectionally calibrated quantization for large language models,” inInterna- tional Conference on Learning Representations, vol. 2024, 2024, pp. 45 472–45 496

2024

[15] [15]

Slicegpt: Com- press large language models by deleting rows and columns,

S. Ashkboos, M. L. Croci, M. G. d. Nascimento, T. Hoefler, and J. Hensman, “Slicegpt: Com- press large language models by deleting rows and columns,”arXiv preprint arXiv:2401.15024, 2024

arXiv 2024

[16] [16]

Sparsegpt: Massive language models can be accurately pruned in one-shot,

E. Frantar and D. Alistarh, “Sparsegpt: Massive language models can be accurately pruned in one-shot,” inInternational conference on machine learning. PMLR, 2023, pp. 10 323–10 337

2023

[17] [17]

Fluctuation-based adaptive structured pruning for large language models,

Y . An, X. Zhao, T. Yu, M. Tang, and J. Wang, “Fluctuation-based adaptive structured pruning for large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 10, 2024, pp. 10 865–10 873

2024

[18] [18]

Language models (mostly) know what they know,

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield- Dodds, N. DasSarma, E. Tran-Johnsonet al., “Language models (mostly) know what they know,”arXiv preprint arXiv:2207.05221, 2022. 10

Pith/arXiv arXiv 2022

[19] [19]

A tutorial on conformal prediction

G. Shafer and V . V ovk, “A tutorial on conformal prediction.”Journal of machine learning research, vol. 9, no. 3, 2008

2008

[20] [20]

Conformal prediction: A gentle introduction,

A. N. Angelopoulos and S. Bates, “Conformal prediction: A gentle introduction,”Foundations and Trends in Machine Learning, vol. 16, no. 4, pp. 494–591, 2023

2023

[21] [21]

Benchmarking llms via uncertainty quantification,

F. Ye, M. Yang, J. Pang, L. Wang, D. F. Wong, E. Yilmaz, S. Shi, and Z. Tu, “Benchmarking llms via uncertainty quantification,”Advances in Neural Information Processing Systems, vol. 37, pp. 15 356–15 385, 2024

2024

[22] [22]

Conformal language modeling,

V . Quach, A. Fisch, T. Schuster, A. Yala, J. H. Sohn, T. Jaakkola, and R. Barzilay, “Conformal language modeling,” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 11 654–11 681

2024

[23] [23]

Large language model validity via enhanced conformal prediction methods,

J. J. Cherian, I. Gibbs, and E. J. Candès, “Large language model validity via enhanced conformal prediction methods,”Advances in Neural Information Processing Systems, vol. 37, pp. 114 812– 114 842, 2024

2024

[24] [24]

Quantifying deep learning model uncertainty in conformal predic- tion,

H. Karimi and R. Samavi, “Quantifying deep learning model uncertainty in conformal predic- tion,” inProceedings of the AAAI Symposium Series, vol. 1, no. 1, 2023, pp. 142–148

2023

[25] [25]

Robust machine unlearning for quantized neural networks via adaptive gradient reweighting with similar labels,

Y . Tong, Y . Wang, J. Yuan, and C. Hu, “Robust machine unlearning for quantized neural networks via adaptive gradient reweighting with similar labels,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 20 603–20 612

2025

[26] [26]

Enhancing quantization-aware training on edge devices via relative entropy coreset selection and cascaded layer correction,

Y . Tong, J. Yuan, and C. Hu, “Enhancing quantization-aware training on edge devices via relative entropy coreset selection and cascaded layer correction,”IEEE Transactions on Mobile Computing, 2026

2026

[27] [27]

Forget by uncertainty: Orthogonal entropy unlearning for quantized neural networks,

T. Zhang, Y . Tong, J. Dong, K. Xu, Y . Wang, and J. Yuan, “Forget by uncertainty: Orthogonal entropy unlearning for quantized neural networks,”arXiv preprint arXiv:2602.00567, 2026

Pith/arXiv arXiv 2026

[28] [28]

Data-free quantization of vision transformers via easy-to-hard synthesis and activation correction,

Y . Tong, J. Yuan, T. Zhang, J. Liu, and C. Hu, “Data-free quantization of vision transformers via easy-to-hard synthesis and activation correction,”ACM Transactions on Multimedia Computing, Communications and Applications, 2025

2025

[29] [29]

A simple and effective pruning approach for large language models,

M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,”arXiv preprint arXiv:2306.11695, 2023

Pith/arXiv arXiv 2023

[30] [30]

Gptq: Accurate post-training quantization for generative pre-trained transformers,

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,”arXiv preprint arXiv:2210.17323, 2022

Pith/arXiv arXiv 2022

[31] [31]

Spinquant: Llm quantization with learned rotations,

Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V . Chandra, Y . Tian, and T. Blankevoort, “Spinquant: Llm quantization with learned rotations,”arXiv preprint arXiv:2405.16406, 2024

Pith/arXiv arXiv 2024

[32] [32]

On the convergence of muon and beyond,

D. Chang, Y . Liu, and G. Yuan, “On the convergence of muon and beyond,”arXiv preprint arXiv:2509.15816, 2025

Pith/arXiv arXiv 2025

[33] [33]

Muoneq: Balancing before orthogonalization with lightweight equilibration,

D. Chang, Q. Shi, L. Zhang, Y . Li, R. Zhang, Y . Lu, Y . Liu, and G. Yuan, “Muoneq: Balancing before orthogonalization with lightweight equilibration,”arXiv preprint arXiv:2603.28254, 2026

Pith/arXiv arXiv 2026

[34] [34]

Mgup: A momentum-gradient alignment update policy for stochastic optimization,

D. Chang and G. Yuan, “Mgup: A momentum-gradient alignment update policy for stochastic optimization,”Advances in Neural Information Processing Systems, vol. 38, pp. 20 488–20 537, 2026

2026

[35] [35]

Flatquant: Flatness matters for llm quantization,

Y . Sun, R. Liu, H. Bai, H. Bao, K. Zhao, Y . Li, J. Hu, X. Yu, L. Hou, C. Yuanet al., “Flatquant: Flatness matters for llm quantization,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 57 587–57 613

2025

[36] [36]

Qlora: Efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”Advances in neural information processing systems, vol. 36, pp. 10 088–10 115, 2023

2023

[37] [37]

Atom: Low-bit quantization for efficient and accurate llm serving,

Y . Zhao, C.-Y . Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, and B. Kasikci, “Atom: Low-bit quantization for efficient and accurate llm serving,”Proceedings of Machine Learning and Systems, vol. 6, pp. 196–209, 2024

2024

[38] [38]

Llm-pruner: On the structural pruning of large language models,

X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural pruning of large language models,” Advances in neural information processing systems, vol. 36, pp. 21 702–21 720, 2023. 11

2023

[39] [39]

On verbalized confidence scores for llms,

D. Yang, Y .-H. H. Tsai, and M. Yamada, “On verbalized confidence scores for llms,”arXiv preprint arXiv:2412.14737, 2024

Pith/arXiv arXiv 2024

[40] [40]

Ensemble based systems in decision making,

R. Polikar, “Ensemble based systems in decision making,”IEEE Circuits and systems magazine, vol. 6, no. 3, pp. 21–45, 2006

2006

[41] [41]

Simple and scalable predictive uncertainty estimation using deep ensembles,

B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,”Advances in neural information processing systems, vol. 30, 2017

2017

[42] [42]

Probabilistic forecasting using monte carlo dropout neural networks,

C. Serpell, I. Araya, C. Valle, and H. Allende, “Probabilistic forecasting using monte carlo dropout neural networks,” inIberoamerican congress on pattern recognition. Springer, 2019, pp. 387–397

2019

[43] [43]

Dropout as a bayesian approximation: Representing model uncer- tainty in deep learning,

Y . Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncer- tainty in deep learning,” ininternational conference on machine learning. PMLR, 2016, pp. 1050–1059

2016

[44] [44]

Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping,

S. Chen, Y . Guo, Y . Ye, S. Huang, W. Hu, H. Li, M. Zhang, J. Chen, S. Guo, and N. Peng, “Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping,”arXiv preprint arXiv:2510.08457, 2025

arXiv 2025

[45] [45]

Evidential deep learning to quantify classification uncertainty,

M. Sensoy, L. Kaplan, and M. Kandemir, “Evidential deep learning to quantify classification uncertainty,”Advances in neural information processing systems, vol. 31, 2018

2018

[46] [46]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,

L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,”arXiv preprint arXiv:2302.09664, 2023

Pith/arXiv arXiv 2023

[47] [47]

On calibration of modern neural networks,

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” inInternational conference on machine learning. PMLR, 2017, pp. 1321–1330

2017

[48] [48]

Revisiting the calibration of modern neural networks,

M. Minderer, J. Djolonga, R. Romijnders, F. Hubis, X. Zhai, N. Houlsby, D. Tran, and M. Lu- cic, “Revisiting the calibration of modern neural networks,”Advances in neural information processing systems, vol. 34, pp. 15 682–15 694, 2021

2021

[49] [49]

Copu: Conformal prediction for uncertainty quantification in natural language generation,

S. Wang, Y . Jiang, Y . Tang, L. Cheng, and H. Chen, “Copu: Conformal prediction for uncertainty quantification in natural language generation,”arXiv preprint arXiv:2502.12601, 2025

arXiv 2025

[50] [50]

Conformal prediction with large language models for multi-choice question answering,

B. Kumar, C. Lu, G. Gupta, A. Palepu, D. Bellamy, R. Raskar, and A. Beam, “Conformal prediction with large language models for multi-choice question answering,”arXiv preprint arXiv:2305.18404, 2023

arXiv 2023

[51] [51]

Non-exchangeable conformal language generation with nearest neighbors,

D. Ulmer, C. Zerva, and A. F. Martins, “Non-exchangeable conformal language generation with nearest neighbors,” inFindings of the Association for Computational Linguistics: EACL 2024, 2024, pp. 1909–1929

2024

[52] [52]

Uncertainty quantification and confidence calibration in large language models: A survey,

X. Liu, T. Chen, L. Da, C. Chen, Z. Lin, and H. Wei, “Uncertainty quantification and confidence calibration in large language models: A survey,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 6107–6117

2025

[53] [53]

Quantized can still be calibrated: A unified framework to calibration in quantized large language models,

M. Zhong, G. Wang, Y .-N. Chuang, and N. Zou, “Quantized can still be calibrated: A unified framework to calibration in quantized large language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 30 503–30 517

2025

[54] [54]

Least ambiguous set-valued classifiers with bounded error levels,

M. Sadinle, J. Lei, and L. Wasserman, “Least ambiguous set-valued classifiers with bounded error levels,”Journal of the American Statistical Association, vol. 114, no. 525, pp. 223–234, 2019

2019

[55] [55]

Classification with valid and adaptive coverage,

Y . Romano, M. Sesia, and E. Candes, “Classification with valid and adaptive coverage,”Ad- vances in neural information processing systems, vol. 33, pp. 3581–3591, 2020

2020

[56] [56]

Measuring massive multitask language understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,”arXiv preprint arXiv:2009.03300, 2020

Pith/arXiv arXiv 2009

[57] [57]

Cosmos qa: Machine reading compre- hension with contextual commonsense reasoning,

L. Huang, R. Le Bras, C. Bhagavatula, and Y . Choi, “Cosmos qa: Machine reading compre- hension with contextual commonsense reasoning,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 2019, pp. 2391–2401. 12

2019

[58] [58]

Hellaswag: Can a machine really finish your sentence?

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi, “Hellaswag: Can a machine really finish your sentence?” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 4791–4800

2019

[59] [59]

Halueval: A large-scale hallucination evaluation benchmark for large language models,

J. Li, X. Cheng, X. Zhao, J.-Y . Nie, and J.-R. Wen, “Halueval: A large-scale hallucination evaluation benchmark for large language models,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 6449–6464

2023

[60] [60]

Llama 2: Open foundation and fine-tuned chat models,

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

Pith/arXiv arXiv 2023

[61] [61]

Qwen3 technical report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[62] [62]

Deepseek llm: Scaling open-source language models with longtermism,

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu et al., “Deepseek llm: Scaling open-source language models with longtermism,”arXiv preprint arXiv:2401.02954, 2024

Pith/arXiv arXiv 2024

[63] [63]

The falcon series of open language models,

E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malarticet al., “The falcon series of open language models,”arXiv preprint arXiv:2311.16867, 2023

Pith/arXiv arXiv 2023

[64] [64]

Holistic evaluation of language models,

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumaret al., “Holistic evaluation of language models,”arXiv preprint arXiv:2211.09110, 2022

Pith/arXiv arXiv 2022

[65] [65]

A survey on evaluation of large language models,

Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM transactions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024. 13 A Appendix A.1 Prompting Strategies Following [21], we evaluate all models with prompt-based inference rather than task-s...

2024