pith. machine review for the scientific record. sign in

arxiv: 2604.08627 · v1 · submitted 2026-04-09 · 💻 cs.LG · cs.AI

Recognition: unknown

Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords evidential deep learningpost-hoc uncertainty estimationpretrained modelsDirichlet distributionaffine transformationimage classificationlanguage model QAout-of-distribution detection
0
0 comments X

The pith

A lightweight post-hoc module turns any pretrained model into an evidential model by learning a sample-dependent affine transform on its logits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pretrained models deliver predictions but rarely supply trustworthy uncertainty estimates. Full retraining with evidential methods is costly, and existing post-hoc fixes often fall short on out-of-distribution data. The Evidential Transformation Network inserts a small trainable layer after a frozen pretrained network. This layer applies an affine transformation to the logits that depends on the input sample and treats the resulting values as concentration parameters of a Dirichlet distribution. Experiments across image classification and language-model question answering show improved uncertainty calibration while accuracy stays intact and added compute remains negligible.

Core claim

The Evidential Transformation Network converts a pretrained classifier into an evidential model by learning a sample-dependent affine transformation of the logits and interpreting the transformed outputs directly as the parameters of a Dirichlet distribution, thereby enabling reliable uncertainty estimation for both in-distribution and out-of-distribution inputs without access to internal model states or full retraining.

What carries the argument

The sample-dependent affine transformation applied to logits, which produces the concentration parameters of a Dirichlet distribution for uncertainty quantification.

If this is right

  • Any pretrained image or language classifier can receive evidential uncertainty estimates without modifying its weights or architecture.
  • Accuracy on the original task is preserved because the base model remains frozen during ETN training.
  • Computational cost stays low because only a lightweight module is added at inference time.
  • The same procedure applies across vision and language benchmarks under both in-distribution and out-of-distribution conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The success of a simple affine map on logits implies that much of the information needed for evidential uncertainty can be recovered from output logits without internal activations.
  • Practitioners could retrofit existing deployed models with ETN to support safer rejection or deferral decisions in high-stakes settings.
  • The approach might extend to other output distributions beyond Dirichlet if analogous lightweight transformations prove effective.

Load-bearing premise

A learned sample-dependent affine transformation of the logits alone is sufficient to yield Dirichlet parameters that accurately quantify uncertainty for both in-distribution and out-of-distribution cases.

What would settle it

If ETN applied to a held-out pretrained model on a new out-of-distribution benchmark produces uncertainty scores whose correlation with actual errors is no better than temperature scaling or other logit-only baselines.

Figures

Figures reproduced from arXiv: 2604.08627 by Chanhee Park, Heuiseok Lim, Jaehyung Seo, Jeongho Yoon, Yongchan Chun.

Figure 1
Figure 1. Figure 1: Comparison of average uncertainty estimation perfor [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of uncertainty estimation performance based on different dimension of transformation parameter [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of uncertainty estimation performance based on different transformation methods. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation studies on each AUPR scores on different parameters of prior distribution (left) and different number of MC samples [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of uncertainty estimation performance based on different transformation methods on CIFAR-10 and OBQA. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of uncertainty estimation performance and accuracy across different dimensionalities of the transformation parameter [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Histograms of logit margins for models trained with EDL [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of uncertainty estimation performance and accuracy across different dimensionalities of the transformation parameter [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of uncertainty estimation performance and accuracy across different dimensionalities of the transformation parameter [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Pretrained models have become standard in both vision and language, yet they typically do not provide reliable measures of confidence. Existing uncertainty estimation methods, such as deep ensembles and MC dropout, are often too computationally expensive to deploy in practice. Evidential Deep Learning (EDL) offers a more efficient alternative, but it requires models to be trained to output evidential quantities from the start, which is rarely true for pretrained networks. To enable EDL-style uncertainty estimation in pretrained models, we propose the Evidential Transformation Network (ETN), a lightweight post-hoc module that converts a pretrained predictor into an evidential model. ETN operates in logit space: it learns a sample-dependent affine transformation of the logits and interprets the transformed outputs as parameters of a Dirichlet distribution for uncertainty estimation. We evaluate ETN on image classification and large language model question-answering benchmarks under both in-distribution and out-of-distribution settings. ETN consistently improves uncertainty estimation over post-hoc baselines while preserving accuracy and adding only minimal computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Evidential Transformation Network (ETN), a lightweight post-hoc module that applies a learned sample-dependent affine transformation to the logits of a pretrained model and interprets the transformed values as concentration parameters of a Dirichlet distribution. This enables evidential-style uncertainty estimation for both in-distribution and out-of-distribution inputs on image classification and LLM question-answering benchmarks without retraining the base model or accessing internal activations. The central claim is that ETN improves uncertainty metrics over post-hoc baselines while preserving accuracy and incurring only minimal overhead.

Significance. If the empirical gains hold under rigorous validation, ETN would provide a practical, low-cost route to reliable uncertainty quantification for deployed pretrained models in vision and language, filling a gap between expensive methods like ensembles and the limitations of standard post-hoc calibration. The post-hoc, logit-only design is a strength for compatibility with existing networks.

major comments (2)
  1. [Method (ETN definition and Dirichlet interpretation)] The load-bearing assumption that a per-sample affine transform in logit space alone can produce trustworthy Dirichlet parameters for OOD uncertainty (without internal states or retraining) is not adequately supported. When OOD and ID logit distributions overlap—a common regime—the transform has no additional signal to differentiate evidence, yet the paper treats the resulting Dirichlet as reliable for both regimes. An ablation or analysis demonstrating robustness in overlapping-logit cases is required.
  2. [Abstract and Evaluation] No quantitative results, specific metrics (e.g., AUROC, ECE), training details for the ETN parameters, loss functions, or error bars appear in the abstract or high-level description, making it impossible to verify whether the data support the claim of consistent improvement. The full evaluation section must supply these with statistical significance tests.
minor comments (2)
  1. [Method] Clarify the exact parameterization of the sample-dependent affine transform (e.g., whether scale and shift are class-specific or shared, and how they are optimized).
  2. [Experiments] Add a direct comparison table against recent logit-based post-hoc methods (e.g., temperature scaling variants or Dirichlet calibration) to strengthen the baseline claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and describe the revisions we will implement to improve the manuscript.

read point-by-point responses
  1. Referee: [Method (ETN definition and Dirichlet interpretation)] The load-bearing assumption that a per-sample affine transform in logit space alone can produce trustworthy Dirichlet parameters for OOD uncertainty (without internal states or retraining) is not adequately supported. When OOD and ID logit distributions overlap—a common regime—the transform has no additional signal to differentiate evidence, yet the paper treats the resulting Dirichlet as reliable for both regimes. An ablation or analysis demonstrating robustness in overlapping-logit cases is required.

    Authors: We appreciate the referee's emphasis on this core assumption. The ETN predicts sample-specific affine parameters via a lightweight network operating on the input (consistent with its post-hoc but input-aware design), which in principle allows it to modulate evidence assignment even when raw logit vectors exhibit overlap. Our empirical results on standard ID/OOD benchmarks demonstrate improved uncertainty metrics, indicating that sufficient differentiating signal is captured in practice. Nevertheless, we agree that dedicated analysis of the overlapping-logit regime is needed. In the revision we will add an ablation that (i) quantifies logit overlap between ID and OOD samples, (ii) visualizes the corresponding Dirichlet parameters and uncertainty estimates, and (iii) reports performance relative to baselines under high-overlap conditions. This analysis will be placed in the experimental section. revision: yes

  2. Referee: [Abstract and Evaluation] No quantitative results, specific metrics (e.g., AUROC, ECE), training details for the ETN parameters, loss functions, or error bars appear in the abstract or high-level description, making it impossible to verify whether the data support the claim of consistent improvement. The full evaluation section must supply these with statistical significance tests.

    Authors: We agree that greater quantitative transparency is warranted. While abstracts are conventionally concise, we will revise the abstract to explicitly state the key improvements (e.g., AUROC gains for OOD detection and ECE reductions). In the evaluation section we will add: (i) explicit training details for ETN (optimizer, learning rate, number of epochs, and the evidential loss formulation), (ii) error bars computed over multiple random seeds, and (iii) statistical significance tests (paired t-tests or Wilcoxon signed-rank tests) comparing ETN against each baseline. These additions will directly support the claim of consistent improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: ETN is an independent post-hoc module trained and evaluated on external benchmarks

full rationale

The paper proposes ETN as a lightweight, separately trained module that applies a learned sample-dependent affine transform to the logits of a frozen pretrained model and treats the outputs as Dirichlet concentration parameters. This construction is defined explicitly as an add-on component with its own parameters optimized on held-out data; no equation reduces the claimed uncertainty estimates to the pretrained model's outputs by definition, and no central premise is justified solely by self-citation. All reported improvements are measured against standard external ID/OOD benchmarks rather than internal consistency checks, so the derivation chain remains self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of a learned affine transformation in logit space and the interpretation of its outputs as Dirichlet parameters; these elements are introduced by the paper rather than derived from prior results.

free parameters (1)
  • parameters of the sample-dependent affine transformation
    These are learned during post-hoc training of the ETN module on the target data.
axioms (1)
  • domain assumption Transformed logits can be directly interpreted as parameters of a Dirichlet distribution for uncertainty estimation
    This interpretation is invoked as the core of the ETN method in the abstract.
invented entities (1)
  • Evidential Transformation Network (ETN) no independent evidence
    purpose: Lightweight post-hoc conversion of pretrained predictors into evidential models
    New module proposed in the paper with no independent external evidence provided in the abstract.

pith-pipeline@v0.9.0 · 5488 in / 1439 out tokens · 58509 ms · 2026-05-10T16:50:22.344662+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rethinking Vacuity for OOD Detection in Evidential Deep Learning

    cs.AI 2026-05 accept novelty 7.0

    Vacuity-based OOD detection in evidential deep learning is highly sensitive to class cardinality differences between ID and OOD, which can artificially inflate AUROC and AUPR without any change in model predictions.

Reference graph

Works this paper leans on

60 extracted references · 6 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27), 2024

    Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27), 2024. 7

  2. [2]

    Beyond next token probabilities: Learnable, fast detection of hallucinations and data contamination on llm output distribu- tions, 2025

    Guy Bar-Shalom, Fabrizio Frasca, Derek Lim, Yoav Gelberg, Yftah Ziser, Ran El-Yaniv, Gal Chechik, and Haggai Maron. Beyond next token probabilities: Learnable, fast detection of hallucinations and data contamination on llm output distribu- tions, 2025. 1

  3. [3]

    On second-order scoring rules for epistemic uncertainty quantifi- cation, 2023

    Viktor Bengs, Eyke H¨ullermeier, and Willem Waegeman. On second-order scoring rules for epistemic uncertainty quantifi- cation, 2023. 1

  4. [4]

    Posterior network: Uncertainty estimation without ood samples via density-based pseudo-counts.Advances in neural information processing systems, 33:1356–1367, 2020

    Bertrand Charpentier, Daniel Z ¨ugner, and Stephan G ¨unne- mann. Posterior network: Uncertainty estimation without ood samples via density-based pseudo-counts.Advances in neural information processing systems, 33:1356–1367, 2020. 3

  5. [5]

    R-edl: Relaxing nonessential settings of evidential deep learning

    Mengyuan Chen, Junyu Gao, and Changsheng Xu. R-edl: Relaxing nonessential settings of evidential deep learning. In The Twelfth International Conference on Learning Represen- tations, 2024. 2, 3, 4, 6

  6. [6]

    A variational dirichlet framework for out-of-distribution detec- tion.arXiv preprint arXiv:1811.07308, 2018

    Wenhu Chen, Yilin Shen, Hongxia Jin, and William Wang. A variational dirichlet framework for out-of-distribution detec- tion.arXiv preprint arXiv:1811.07308, 2018. 3

  7. [7]

    Laplace redux-effortless bayesian deep learning.Advances in neural information processing systems, 34:20089–20103, 2021

    Erik Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. Laplace redux-effortless bayesian deep learning.Advances in neural information processing systems, 34:20089–20103, 2021. 1, 2, 6, 4

  8. [8]

    Uncertainty estimation by fisher information-based evidential deep learning, 2023

    Danruo Deng, Guangyong Chen, Yang Yu, Furui Liu, and Pheng-Ann Heng. Uncertainty estimation by fisher information-based evidential deep learning, 2023. 3, 5, 6

  9. [9]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 6

  10. [10]

    Evans and J.S

    M.J. Evans and J.S. Rosenthal.Probability and Statistics: The Science of Uncertainty. W. H. Freeman, 2004. 2

  11. [11]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learn- ing

    Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learn- ing. InProceedings of The 33rd International Conference on Machine Learning, pages 1050–1059, New York, New York, USA, 2016. PMLR. 1, 6

  12. [12]

    A survey of uncertainty in deep neural networks.Artifi- cial Intelligence Review, 56(Suppl 1):1513–1589, 2023

    Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxiang Feng, Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, et al. A survey of uncertainty in deep neural networks.Artifi- cial Intelligence Review, 56(Suppl 1):1513–1589, 2023. 1, 2, 5

  13. [13]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  14. [14]

    On calibration of modern neural networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational conference on machine learning, pages 1321–1330. PMLR,

  15. [15]

    Deep residual learning for image recognition, 2015

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. 4

  16. [16]

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

    Dan Hendrycks and Kevin Gimpel. A baseline for detect- ing misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136, 2016. 3

  17. [17]

    The many faces of robust- ness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robust- ness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021. 6

  18. [18]

    Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Man- tas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021. 6

  19. [19]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15262–15271, 2021. 6

  20. [20]

    Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically, 2017. 7

  21. [21]

    Logits are all we need to adapt closed models, 2025

    Gaurush Hiranandani, Haolun Wu, Subhojyoti Mukherjee, and Sanmi Koyejo. Logits are all we need to adapt closed models, 2025. 1

  22. [22]

    Being bayesian about categorical probability

    Taejong Joo, Uijung Chung, and Min-Gwan Seo. Being bayesian about categorical probability. InInternational con- ference on machine learning, pages 4950–4961. PMLR, 2020. 3

  23. [23]

    Springer Publishing Company, Incorpo- rated, 1st edition, 2016

    Audun Jøsang.Subjective Logic: A Formalism for Reasoning Under Uncertainty. Springer Publishing Company, Incorpo- rated, 1st edition, 2016. 2

  24. [24]

    Sample-dependent adaptive temperature scaling for improved calibration

    Tom Joy, Francesco Pinto, Ser-Nam Lim, Philip HS Torr, and Puneet K Dokania. Sample-dependent adaptive temperature scaling for improved calibration. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14919–14926,

  25. [25]

    Is epistemic uncertainty faithfully represented by evidential deep learning methods? InInterna- tional Conference on Machine Learning, pages 22624–22642

    Mira Juergens, Nis Meinert, Viktor Bengs, Eyke H¨ullermeier, and Willem Waegeman. Is epistemic uncertainty faithfully represented by evidential deep learning methods? InInterna- tional Conference on Machine Learning, pages 22624–22642. PMLR, 2024. 1, 3

  26. [26]

    Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Rad- ford, Jeffrey Wu, and Dario Amodei

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Rad- ford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. 1

  27. [27]

    Learning multiple layers of features from tiny images.University of Toronto, 2012

    Alex Krizhevsky. Learning multiple layers of features from tiny images.University of Toronto, 2012. 6

  28. [28]

    Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration.Advances in neural information processing systems, 32, 2019

    Meelis Kull, Miquel Perello Nieto, Markus K¨angsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration.Advances in neural information processing systems, 32, 2019. 3

  29. [29]

    RACE: Large-scale ReAding comprehension dataset from examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding comprehension dataset from examinations. InProceedings of the 2017 Confer- ence on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark, 2017. Association for Computational Linguistics. 6

  30. [30]

    Simple and scalable predictive uncertainty estima- tion using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estima- tion using deep ensembles. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2017. 1, 6

  31. [31]

    A simple unified framework for detecting out-of-distribution samples and adversarial attacks.Advances in neural informa- tion processing systems, 31, 2018

    Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks.Advances in neural informa- tion processing systems, 31, 2018. 4

  32. [32]

    Calibrating LLMs with Information-Theoretic Evidential Deep Learning, February 2025

    Yawei Li, David R¨ugamer, Bernd Bischl, and Mina Rezaei. Calibrating llms with information-theoretic evidential deep learning.arXiv preprint arXiv:2502.06351, 2025. 2, 3, 6, 4

  33. [33]

    Enhancing the reliabil- ity of out-of-distribution image detection in neural networks

    Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the reliabil- ity of out-of-distribution image detection in neural networks. InInternational Conference on Learning Representations,

  34. [34]

    Simple and principled uncertainty estimation with deterministic deep learning via distance awareness.Advances in neural informa- tion processing systems, 33:7498–7512, 2020

    Jeremiah Liu, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax Weiss, and Balaji Lakshminarayanan. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness.Advances in neural informa- tion processing systems, 33:7498–7512, 2020. 5

  35. [35]

    Large-Margin Softmax Loss for Convolutional Neural Networks

    Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional neural networks. arXiv preprint arXiv:1612.02295, 2016. 5

  36. [36]

    Predictive uncertainty esti- mation via prior networks.Advances in neural information processing systems, 31, 2018

    Andrey Malinin and Mark Gales. Predictive uncertainty esti- mation via prior networks.Advances in neural information processing systems, 31, 2018. 2, 3

  37. [37]

    Reverse kl-divergence train- ing of prior networks: Improved uncertainty and adversarial robustness

    Andrey Malinin and Mark Gales. Reverse kl-divergence train- ing of prior networks: Improved uncertainty and adversarial robustness. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2019. 3, 4

  38. [38]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sab- harwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018. 6

  39. [39]

    Revisiting the calibration of modern neural networks

    Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. Revisiting the calibration of modern neural networks. Advances in neural information processing systems, 34:15682– 15694, 2021. 3

  40. [40]

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y . Ng. Reading digits in natural images with unsupervised feature learning. InNIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011,

  41. [41]

    Predicting good probabilities with supervised learning

    Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. InProceedings of the 22nd international conference on Machine learning, pages 625–632, 2005. 2, 4

  42. [42]

    Learn to accumulate evidence from all training samples: theory and practice

    Deep Pandey and Qi Yu. Learn to accumulate evidence from all training samples: theory and practice. InProceedings of the 40th International Conference on Machine Learning, pages 26963–26989, 2023. 5

  43. [43]

    Probabilistic outputs for support vector ma- chines and comparisons to regularized likelihood methods

    John Platt et al. Probabilistic outputs for support vector ma- chines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999. 2, 4

  44. [44]

    Eviden- tial deep learning to quantify classification uncertainty, 2018

    Murat Sensoy, Lance Kaplan, and Melih Kandemir. Eviden- tial deep learning to quantify classification uncertainty, 2018. 1, 2, 3

  45. [45]

    Post-hoc uncertainty learning using a dirichlet meta-model, 2022

    Maohao Shen, Yuheng Bu, Prasanna Sattigeri, Soumya Ghosh, Subhro Das, and Gregory Wornell. Post-hoc uncertainty learning using a dirichlet meta-model, 2022. 2, 6, 4

  46. [46]

    Thermome- ter: towards universal calibration for large language models

    Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory Wornell, and Soumya Ghosh. Thermome- ter: towards universal calibration for large language models. InProceedings of the 41st International Conference on Ma- chine Learning. JMLR.org, 2024. 4

  47. [47]

    Are uncertainty quantification capabilities of evidential deep learn- ing a mirage?Advances in Neural Information Processing Systems, 37:107830–107864, 2024

    Maohao Shen, Jongha Jon Ryu, Soumya Ghosh, Yuheng Bu, Prasanna Sattigeri, Subhro Das, and Gregory Wornell. Are uncertainty quantification capabilities of evidential deep learn- ing a mirage?Advances in Neural Information Processing Systems, 37:107830–107864, 2024. 3, 6, 1, 4

  48. [48]

    Very deep convolutional net- works for large-scale image recognition

    K Simonyan and A Zisserman. Very deep convolutional net- works for large-scale image recognition. In3rd International Conference on Learning Representations (ICLR 2015). Com- putational and Biological Learning Society, 2015. 3

  49. [49]

    Least squares sup- port vector machine classifiers.Neural processing letters, 9 (3):293–300, 1999

    Johan AK Suykens and Joos Vandewalle. Least squares sup- port vector machine classifiers.Neural processing letters, 9 (3):293–300, 1999. 5

  50. [50]

    Uncertainty estimation using a single deep determinis- tic neural network

    Joost Van Amersfoort, Lewis Smith, Yee Whye Teh, and Yarin Gal. Uncertainty estimation using a single deep determinis- tic neural network. InInternational conference on machine learning, pages 9690–9700. PMLR, 2020. 5

  51. [51]

    Learning robust global representations by penalizing local pre- dictive power

    Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local pre- dictive power. InAdvances in Neural Information Processing Systems, pages 10506–10518, 2019. 6

  52. [52]

    The generalised product moment distribution in samples from a normal multivariate population.Biometrika, 20(1/2):32–52, 1928

    John Wishart. The generalised product moment distribution in samples from a normal multivariate population.Biometrika, 20(1/2):32–52, 1928. 3

  53. [53]

    Bayesian low-rank adaptation for large language models

    Adam X Yang, Maxime Robeyns, Xi Wang, and Laurence Aitchison. Bayesian low-rank adaptation for large language models. InThe Twelfth International Conference on Learning Representations. 6, 3, 4

  54. [54]

    Yoon and H

    Taeseong Yoon and Heeyoung Kim. Uncertainty estimation by density aware evidential deep learning.arXiv preprint arXiv:2409.08754, 2024. 2, 4, 5, 6 Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation Supplementary Material

  55. [55]

    First, the benefits of ETN are largely empirical rather than theoretical

    Limitations While ETN improves the uncertainty estimation performance of pretrained models without harming accuracy and with only minimal additional computational cost, it also has sev- eral limitations. First, the benefits of ETN are largely empirical rather than theoretical. Recent works have raised concerns about EDL from a theoretical standpoint, argu...

  56. [56]

    Proofs and Derivations In this section, we analyze the behavior of logits produced by models trained with cross-entropy and EDL losses. We first define the softmax per-sample(x, y) cross-entropy loss as: LCE(z, y) =−log ezy PC j=1 ezj = log 1 + X j̸=y e zj −zy Then we define the inter-class margin of an sample as: γ(z, y) =z y −max j̸=y zj Given these def...

  57. [57]

    Specifically, we explain (1) how the variational distribution over A is constructed, and (2) how the prior termb is handled

    Modeling Transformation Parameteriza- tions In this section, we describe how the transformation parameter A is modeled when defined as a scalar, vector, or matrix. Specifically, we explain (1) how the variational distribution over A is constructed, and (2) how the prior termb is handled. For clarity, we denote the scalar case by a, the vector case bya, an...

  58. [58]

    Training Details The hyperparameters used for training ETN are summa- rized in Table 3

    Experimental Setting 11.1. Training Details The hyperparameters used for training ETN are summa- rized in Table 3. For LLM experiments, we employ cosine learning-rate scheduling with warm-up steps. All experi- ments are performed using three different random seeds, and we report the mean along with 95% confidence intervals. For post-hoc uncertainty estima...

  59. [59]

    for all experimental settings, and train the additional parameters using the reverse KL formulation ofL EDL

  60. [60]

    OOD-Detection Baselines In this section, we compare ETN against ODIN [33] and the Mahalanobis distance method (MD) [31]

    Additional Experiments 12.1. OOD-Detection Baselines In this section, we compare ETN against ODIN [33] and the Mahalanobis distance method (MD) [31]. Although neither ODIN nor MD are strictly uncertainty estimation methods, we include them as they both work in post-hoc manner, and Method CIFAR10→CIFAR10-OOD ImageNet→ImageNet-OOD OBQA→MMLU RACE→MMLU MD45.4...