pith. sign in

arxiv: 2606.01850 · v1 · pith:H36ORTLFnew · submitted 2026-06-01 · 💻 cs.AI

Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

Pith reviewed 2026-06-28 15:03 UTC · model grok-4.3

classification 💻 cs.AI
keywords model compressionquantizationpruninguncertainty quantificationconformal predictionlarge language modelsNLP evaluation
0
0 comments X

The pith

Compression frequently decouples accuracy from uncertainty in large language models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks twelve large language models under quantization and pruning across five NLP tasks to determine whether these compression methods preserve the models' ability to quantify uncertainty. It applies conformal prediction to obtain a distribution-free uncertainty measure. The results show that accuracy and uncertainty often become disconnected after compression, larger models absorb the induced uncertainty more effectively, and uncertainty inflation tends to occur in abrupt thresholds rather than gradually. This matters because safety-critical applications rely on reliable uncertainty estimates, so accuracy-only tests may miss deployment risks.

Core claim

Using conformal prediction on compressed LLMs, the study finds that quantization and pruning frequently decouple accuracy from uncertainty, that larger models absorb compression-induced uncertainty far more effectively than smaller ones, and that uncertainty inflation is often threshold-like rather than gradual.

What carries the argument

Conformal prediction, which supplies a distribution-free measure of uncertainty for the outputs of quantized and pruned LLMs.

If this is right

  • Accuracy preservation does not guarantee preservation of uncertainty under compression.
  • Larger models are more resilient to compression effects on uncertainty than smaller models.
  • Uncertainty inflation tends to occur abruptly rather than gradually as compression increases.
  • Accuracy-only evaluation is insufficient for assessing deployment readiness of compressed LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed threshold behavior implies that safe compression levels may need to be identified separately for each model size.
  • Uncertainty-aware checks could become part of standard compression pipelines beyond the accuracy focus used today.
  • The decoupling pattern may appear in other efficiency techniques such as distillation that were not tested here.

Load-bearing premise

Conformal prediction supplies a valid measure of uncertainty that remains applicable to the outputs of quantized and pruned LLMs.

What would settle it

Finding that uncertainty measures remain aligned with accuracy across all tested compression levels and model sizes would challenge the decoupling result.

Figures

Figures reproduced from arXiv: 2606.01850 by Jingling Yuan, Junhao Dong, Tian Zhang, Yujia Tong, Yunyang Wan, Yuxi Wang.

Figure 1
Figure 1. Figure 1: Overview of our benchmark. Left: two models with the same accuracy can exhibit very [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy change (∆Acc) vs. uncertainty change (∆SS) relative to uncompressed baselines for five models across all compression methods and tasks. If accuracy and uncertainty were coupled, points would concentrate along a consistent negative-slope trend; instead, substantial dispersion is observed in all subplots, confirming that compression frequently decouples the two metrics (Finding 1). Comparing across … view at source ↗
Figure 3
Figure 3. Figure 3: Prediction set size (SS) as a function of Wanda pruning sparsity (0%–50%) across five tasks [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Model compression techniques such as quantization and pruning are widely used to reduce the deployment cost of large language models (LLMs), with existing evaluations focusing almost exclusively on accuracy preservation. However, in safety-critical applications, a model's ability to reliably quantify its own uncertainty is equally important. We ask: does compression preserve this ability? To answer this question, we benchmark 12 LLMs under various compression configurations across five NLP tasks, using conformal prediction to provide a rigorous, distribution-free measure of uncertainty. Our experiments reveal that: (I) compression frequently decouples accuracy from uncertainty; (II) larger models absorb compression-induced uncertainty far more effectively than smaller ones; and (III) uncertainty inflation is often threshold-like rather than gradual. These results suggest that accuracy-only evaluation is insufficient for assessing the deployment readiness of compressed LLMs, and that uncertainty-aware benchmarking should be a standard component of model compression pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks 12 LLMs under quantization and pruning on five NLP tasks, using conformal prediction to measure uncertainty. It reports that compression frequently decouples accuracy from uncertainty, larger models absorb compression-induced uncertainty more effectively, and uncertainty inflation tends to be threshold-like rather than gradual, concluding that accuracy-only evaluations are insufficient for assessing compressed LLMs in safety-critical settings.

Significance. If the empirical results hold under detailed scrutiny, the work provides a useful demonstration that uncertainty quantification can degrade independently of accuracy under compression. The model-agnostic nature of conformal prediction strengthens the benchmark approach, and the scale (12 models, 5 tasks) offers a broad view that could encourage uncertainty-aware practices in compression pipelines.

major comments (2)
  1. [Experimental methodology] Experimental methodology section: the manuscript invokes conformal prediction as a 'rigorous, distribution-free measure' but supplies no description of the nonconformity score, calibration-set construction, or how exchangeability is maintained for the specific NLP tasks and compressed model outputs. This detail is load-bearing for validating claims (I)-(III) on decoupling and threshold behavior.
  2. [Results] Results section (across the 12 models and 5 tasks): the reported patterns of decoupling and threshold-like inflation are presented without statistical tests, error bars, or sensitivity analysis to data splits. This undermines the strength of the qualifiers 'frequently' and 'often' in the abstract and conclusions.
minor comments (2)
  1. [Abstract] The abstract and introduction refer to 'various compression configurations' without enumerating the specific bit-widths, sparsity ratios, or methods (e.g., GPTQ vs. AWQ) tested; adding this would improve interpretability of the threshold-like behavior.
  2. [Figures/Tables] Figure and table captions should explicitly state the number of random seeds or runs used to generate the plotted uncertainty metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional methodological detail and statistical support would strengthen the manuscript. We will revise accordingly to address both major comments.

read point-by-point responses
  1. Referee: [Experimental methodology] Experimental methodology section: the manuscript invokes conformal prediction as a 'rigorous, distribution-free measure' but supplies no description of the nonconformity score, calibration-set construction, or how exchangeability is maintained for the specific NLP tasks and compressed model outputs. This detail is load-bearing for validating claims (I)-(III) on decoupling and threshold behavior.

    Authors: We agree that the Experimental Methodology section requires explicit implementation details to support the claims. In the revised manuscript we will add a dedicated subsection specifying: the nonconformity score (1 minus the softmax probability assigned to the ground-truth label for the classification tasks used), the calibration-set construction (a fixed 20% random hold-out from each task's training split, kept disjoint from test data), and the exchangeability assumption (standard i.i.d. exchangeability within each task's data distribution, with a brief note on how compression does not alter this assumption for the purpose of coverage guarantees). These additions will directly bolster validation of claims (I)-(III). revision: yes

  2. Referee: [Results] Results section (across the 12 models and 5 tasks): the reported patterns of decoupling and threshold-like inflation are presented without statistical tests, error bars, or sensitivity analysis to data splits. This undermines the strength of the qualifiers 'frequently' and 'often' in the abstract and conclusions.

    Authors: We accept that the Results section would be strengthened by quantitative support for the reported patterns. In revision we will (i) add error bars derived from 5 independent random seeds for calibration/test splits, (ii) perform sensitivity analysis across three different calibration-set sizes, and (iii) include paired statistical tests (Wilcoxon signed-rank) comparing accuracy versus coverage gap under each compression level. These changes will provide empirical grounding for the qualifiers 'frequently' and 'often'. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This is a purely empirical benchmark paper with no derivations, equations, fitted parameters, or load-bearing self-citations. It applies standard conformal prediction (a model-agnostic, distribution-free procedure) to measure uncertainty on compressed LLMs and reports experimental observations across 12 models and five tasks. The central claims (decoupling of accuracy from uncertainty, etc.) are direct outputs of the benchmark results rather than quantities defined by or reduced to the evaluation method itself. No step reduces a prediction or result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that conformal prediction yields valid uncertainty sets for LLM outputs after compression; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Conformal prediction provides distribution-free uncertainty quantification valid for the tested LLM outputs
    Central premise enabling the uncertainty measurements reported in the abstract.

pith-pipeline@v0.9.1-grok · 5698 in / 1057 out tokens · 26664 ms · 2026-06-28T15:03:14.358256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 18 linked inside Pith

  1. [1]

    The Llama 3 herd of models,

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  2. [2]

    Gpt-4 technical report,

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Hierarchically robust zero-shot vision- language models,

    J. Dong, Y . Zhang, H. Zhu, Y .-S. Ong, and P. Koniusz, “Hierarchically robust zero-shot vision- language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2026, pp. 37 642–37 652

  4. [4]

    Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,

    J. Dong, X. Qu, C. Zhang, S. Q. Rong, N. D. Thai, W. Pan, X. Li, T. Liu, P. Koniusz, and Y .-S. Ong, “Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,” inThe Fourteenth International Conference on Learning Representations, 2026

  5. [5]

    Craft-lora: Content-style personalization via rank-constrained adaptation and training-free fusion,

    Y . Li, Y . Cai, and C. Zhang, “Craft-lora: Content-style personalization via rank-constrained adaptation and training-free fusion,”arXiv preprint arXiv:2602.18936, 2026

  6. [6]

    Reason in chains, learn in trees: Self-rectification and grafting for multi-turn agent policy optimization,

    Y . Li, S. Tang, and T. Lan, “Reason in chains, learn in trees: Self-rectification and grafting for multi-turn agent policy optimization,”arXiv preprint arXiv:2604.07165, 2026

  7. [7]

    Sage: Accelerating vision-language models via entropy-guided adaptive speculative decoding,

    Y . Tong, T. Zhang, Y . Wan, K. Lin, J. Yuan, and C. Hu, “Sage: Accelerating vision-language models via entropy-guided adaptive speculative decoding,”arXiv preprint arXiv:2602.00523, 2026

  8. [8]

    A survey on model compression for large language models,

    X. Zhu, J. Li, Y . Liu, C. Ma, and W. Wang, “A survey on model compression for large language models,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 1556–1577, 2024

  9. [9]

    Optimal brain restoration for joint quantization and sparsification of llms,

    H. Guo, Y . Li, and L. Benini, “Optimal brain restoration for joint quantization and sparsification of llms,”arXiv preprint arXiv:2509.11177, 2025

  10. [10]

    Quarot: Outlier-free 4-bit inference in rotated llms,

    S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman, “Quarot: Outlier-free 4-bit inference in rotated llms,”Advances in Neural Information Processing Systems, vol. 37, pp. 100 213–100 240, 2024

  11. [11]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration,

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantization for on-device llm compression and acceleration,”Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024

  12. [12]

    Smoothquant: Accurate and efficient post-training quantization for large language models,

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 38 087–38 099

  13. [13]

    Gpt3. int8 (): 8-bit matrix multiplica- tion for transformers at scale,

    T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Gpt3. int8 (): 8-bit matrix multiplica- tion for transformers at scale,”Advances in neural information processing systems, vol. 35, pp. 30 318–30 332, 2022

  14. [14]

    Omniquant: Omnidirectionally calibrated quantization for large language models,

    W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, G. Peng, Y . Qiao, and P. Luo, “Omniquant: Omnidirectionally calibrated quantization for large language models,” inInterna- tional Conference on Learning Representations, vol. 2024, 2024, pp. 45 472–45 496

  15. [15]

    Slicegpt: Com- press large language models by deleting rows and columns,

    S. Ashkboos, M. L. Croci, M. G. d. Nascimento, T. Hoefler, and J. Hensman, “Slicegpt: Com- press large language models by deleting rows and columns,”arXiv preprint arXiv:2401.15024, 2024

  16. [16]

    Sparsegpt: Massive language models can be accurately pruned in one-shot,

    E. Frantar and D. Alistarh, “Sparsegpt: Massive language models can be accurately pruned in one-shot,” inInternational conference on machine learning. PMLR, 2023, pp. 10 323–10 337

  17. [17]

    Fluctuation-based adaptive structured pruning for large language models,

    Y . An, X. Zhao, T. Yu, M. Tang, and J. Wang, “Fluctuation-based adaptive structured pruning for large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 10, 2024, pp. 10 865–10 873

  18. [18]

    Language models (mostly) know what they know,

    S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield- Dodds, N. DasSarma, E. Tran-Johnsonet al., “Language models (mostly) know what they know,”arXiv preprint arXiv:2207.05221, 2022. 10

  19. [19]

    A tutorial on conformal prediction

    G. Shafer and V . V ovk, “A tutorial on conformal prediction.”Journal of machine learning research, vol. 9, no. 3, 2008

  20. [20]

    Conformal prediction: A gentle introduction,

    A. N. Angelopoulos and S. Bates, “Conformal prediction: A gentle introduction,”Foundations and Trends in Machine Learning, vol. 16, no. 4, pp. 494–591, 2023

  21. [21]

    Benchmarking llms via uncertainty quantification,

    F. Ye, M. Yang, J. Pang, L. Wang, D. F. Wong, E. Yilmaz, S. Shi, and Z. Tu, “Benchmarking llms via uncertainty quantification,”Advances in Neural Information Processing Systems, vol. 37, pp. 15 356–15 385, 2024

  22. [22]

    Conformal language modeling,

    V . Quach, A. Fisch, T. Schuster, A. Yala, J. H. Sohn, T. Jaakkola, and R. Barzilay, “Conformal language modeling,” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 11 654–11 681

  23. [23]

    Large language model validity via enhanced conformal prediction methods,

    J. J. Cherian, I. Gibbs, and E. J. Candès, “Large language model validity via enhanced conformal prediction methods,”Advances in Neural Information Processing Systems, vol. 37, pp. 114 812– 114 842, 2024

  24. [24]

    Quantifying deep learning model uncertainty in conformal predic- tion,

    H. Karimi and R. Samavi, “Quantifying deep learning model uncertainty in conformal predic- tion,” inProceedings of the AAAI Symposium Series, vol. 1, no. 1, 2023, pp. 142–148

  25. [25]

    Robust machine unlearning for quantized neural networks via adaptive gradient reweighting with similar labels,

    Y . Tong, Y . Wang, J. Yuan, and C. Hu, “Robust machine unlearning for quantized neural networks via adaptive gradient reweighting with similar labels,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 20 603–20 612

  26. [26]

    Enhancing quantization-aware training on edge devices via relative entropy coreset selection and cascaded layer correction,

    Y . Tong, J. Yuan, and C. Hu, “Enhancing quantization-aware training on edge devices via relative entropy coreset selection and cascaded layer correction,”IEEE Transactions on Mobile Computing, 2026

  27. [27]

    Forget by uncertainty: Orthogonal entropy unlearning for quantized neural networks,

    T. Zhang, Y . Tong, J. Dong, K. Xu, Y . Wang, and J. Yuan, “Forget by uncertainty: Orthogonal entropy unlearning for quantized neural networks,”arXiv preprint arXiv:2602.00567, 2026

  28. [28]

    Data-free quantization of vision transformers via easy-to-hard synthesis and activation correction,

    Y . Tong, J. Yuan, T. Zhang, J. Liu, and C. Hu, “Data-free quantization of vision transformers via easy-to-hard synthesis and activation correction,”ACM Transactions on Multimedia Computing, Communications and Applications, 2025

  29. [29]

    A simple and effective pruning approach for large language models,

    M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,”arXiv preprint arXiv:2306.11695, 2023

  30. [30]

    Gptq: Accurate post-training quantization for generative pre-trained transformers,

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,”arXiv preprint arXiv:2210.17323, 2022

  31. [31]

    Spinquant: Llm quantization with learned rotations,

    Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V . Chandra, Y . Tian, and T. Blankevoort, “Spinquant: Llm quantization with learned rotations,”arXiv preprint arXiv:2405.16406, 2024

  32. [32]

    On the convergence of muon and beyond,

    D. Chang, Y . Liu, and G. Yuan, “On the convergence of muon and beyond,”arXiv preprint arXiv:2509.15816, 2025

  33. [33]

    Muoneq: Balancing before orthogonalization with lightweight equilibration,

    D. Chang, Q. Shi, L. Zhang, Y . Li, R. Zhang, Y . Lu, Y . Liu, and G. Yuan, “Muoneq: Balancing before orthogonalization with lightweight equilibration,”arXiv preprint arXiv:2603.28254, 2026

  34. [34]

    Mgup: A momentum-gradient alignment update policy for stochastic optimization,

    D. Chang and G. Yuan, “Mgup: A momentum-gradient alignment update policy for stochastic optimization,”Advances in Neural Information Processing Systems, vol. 38, pp. 20 488–20 537, 2026

  35. [35]

    Flatquant: Flatness matters for llm quantization,

    Y . Sun, R. Liu, H. Bai, H. Bao, K. Zhao, Y . Li, J. Hu, X. Yu, L. Hou, C. Yuanet al., “Flatquant: Flatness matters for llm quantization,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 57 587–57 613

  36. [36]

    Qlora: Efficient finetuning of quantized llms,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”Advances in neural information processing systems, vol. 36, pp. 10 088–10 115, 2023

  37. [37]

    Atom: Low-bit quantization for efficient and accurate llm serving,

    Y . Zhao, C.-Y . Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, and B. Kasikci, “Atom: Low-bit quantization for efficient and accurate llm serving,”Proceedings of Machine Learning and Systems, vol. 6, pp. 196–209, 2024

  38. [38]

    Llm-pruner: On the structural pruning of large language models,

    X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural pruning of large language models,” Advances in neural information processing systems, vol. 36, pp. 21 702–21 720, 2023. 11

  39. [39]

    On verbalized confidence scores for llms,

    D. Yang, Y .-H. H. Tsai, and M. Yamada, “On verbalized confidence scores for llms,”arXiv preprint arXiv:2412.14737, 2024

  40. [40]

    Ensemble based systems in decision making,

    R. Polikar, “Ensemble based systems in decision making,”IEEE Circuits and systems magazine, vol. 6, no. 3, pp. 21–45, 2006

  41. [41]

    Simple and scalable predictive uncertainty estimation using deep ensembles,

    B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,”Advances in neural information processing systems, vol. 30, 2017

  42. [42]

    Probabilistic forecasting using monte carlo dropout neural networks,

    C. Serpell, I. Araya, C. Valle, and H. Allende, “Probabilistic forecasting using monte carlo dropout neural networks,” inIberoamerican congress on pattern recognition. Springer, 2019, pp. 387–397

  43. [43]

    Dropout as a bayesian approximation: Representing model uncer- tainty in deep learning,

    Y . Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncer- tainty in deep learning,” ininternational conference on machine learning. PMLR, 2016, pp. 1050–1059

  44. [44]

    Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping,

    S. Chen, Y . Guo, Y . Ye, S. Huang, W. Hu, H. Li, M. Zhang, J. Chen, S. Guo, and N. Peng, “Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping,”arXiv preprint arXiv:2510.08457, 2025

  45. [45]

    Evidential deep learning to quantify classification uncertainty,

    M. Sensoy, L. Kaplan, and M. Kandemir, “Evidential deep learning to quantify classification uncertainty,”Advances in neural information processing systems, vol. 31, 2018

  46. [46]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,

    L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,”arXiv preprint arXiv:2302.09664, 2023

  47. [47]

    On calibration of modern neural networks,

    C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” inInternational conference on machine learning. PMLR, 2017, pp. 1321–1330

  48. [48]

    Revisiting the calibration of modern neural networks,

    M. Minderer, J. Djolonga, R. Romijnders, F. Hubis, X. Zhai, N. Houlsby, D. Tran, and M. Lu- cic, “Revisiting the calibration of modern neural networks,”Advances in neural information processing systems, vol. 34, pp. 15 682–15 694, 2021

  49. [49]

    Copu: Conformal prediction for uncertainty quantification in natural language generation,

    S. Wang, Y . Jiang, Y . Tang, L. Cheng, and H. Chen, “Copu: Conformal prediction for uncertainty quantification in natural language generation,”arXiv preprint arXiv:2502.12601, 2025

  50. [50]

    Conformal prediction with large language models for multi-choice question answering,

    B. Kumar, C. Lu, G. Gupta, A. Palepu, D. Bellamy, R. Raskar, and A. Beam, “Conformal prediction with large language models for multi-choice question answering,”arXiv preprint arXiv:2305.18404, 2023

  51. [51]

    Non-exchangeable conformal language generation with nearest neighbors,

    D. Ulmer, C. Zerva, and A. F. Martins, “Non-exchangeable conformal language generation with nearest neighbors,” inFindings of the Association for Computational Linguistics: EACL 2024, 2024, pp. 1909–1929

  52. [52]

    Uncertainty quantification and confidence calibration in large language models: A survey,

    X. Liu, T. Chen, L. Da, C. Chen, Z. Lin, and H. Wei, “Uncertainty quantification and confidence calibration in large language models: A survey,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 6107–6117

  53. [53]

    Quantized can still be calibrated: A unified framework to calibration in quantized large language models,

    M. Zhong, G. Wang, Y .-N. Chuang, and N. Zou, “Quantized can still be calibrated: A unified framework to calibration in quantized large language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 30 503–30 517

  54. [54]

    Least ambiguous set-valued classifiers with bounded error levels,

    M. Sadinle, J. Lei, and L. Wasserman, “Least ambiguous set-valued classifiers with bounded error levels,”Journal of the American Statistical Association, vol. 114, no. 525, pp. 223–234, 2019

  55. [55]

    Classification with valid and adaptive coverage,

    Y . Romano, M. Sesia, and E. Candes, “Classification with valid and adaptive coverage,”Ad- vances in neural information processing systems, vol. 33, pp. 3581–3591, 2020

  56. [56]

    Measuring massive multitask language understanding,

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,”arXiv preprint arXiv:2009.03300, 2020

  57. [57]

    Cosmos qa: Machine reading compre- hension with contextual commonsense reasoning,

    L. Huang, R. Le Bras, C. Bhagavatula, and Y . Choi, “Cosmos qa: Machine reading compre- hension with contextual commonsense reasoning,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 2019, pp. 2391–2401. 12

  58. [58]

    Hellaswag: Can a machine really finish your sentence?

    R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi, “Hellaswag: Can a machine really finish your sentence?” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 4791–4800

  59. [59]

    Halueval: A large-scale hallucination evaluation benchmark for large language models,

    J. Li, X. Cheng, X. Zhao, J.-Y . Nie, and J.-R. Wen, “Halueval: A large-scale hallucination evaluation benchmark for large language models,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 6449–6464

  60. [60]

    Llama 2: Open foundation and fine-tuned chat models,

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  61. [61]

    Qwen3 technical report,

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  62. [62]

    Deepseek llm: Scaling open-source language models with longtermism,

    X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu et al., “Deepseek llm: Scaling open-source language models with longtermism,”arXiv preprint arXiv:2401.02954, 2024

  63. [63]

    The falcon series of open language models,

    E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malarticet al., “The falcon series of open language models,”arXiv preprint arXiv:2311.16867, 2023

  64. [64]

    Holistic evaluation of language models,

    P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumaret al., “Holistic evaluation of language models,”arXiv preprint arXiv:2211.09110, 2022

  65. [65]

    A survey on evaluation of large language models,

    Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM transactions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024. 13 A Appendix A.1 Prompting Strategies Following [21], we evaluate all models with prompt-based inference rather than task-s...