pith. sign in

arxiv: 2606.19509 · v1 · pith:JYVHZKHBnew · submitted 2026-06-17 · 💻 cs.AI

LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data

Pith reviewed 2026-06-26 20:53 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM calibrationepistemic uncertaintyclinical tabular dataattribution divergencecross-model comparisonexpected calibration errorSHAP valuesXGBoost
0
0 comments X

The pith

A cross-model calibrator using attribution divergence between an LLM and XGBoost reduces expected calibration error on clinical tabular data from 0.254 to 0.080.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLMs cannot detect their own knowledge limits on structured clinical tasks because verbalized confidence remains nearly constant across wide ranges of actual accuracy. It shows that divergence in feature attributions between the LLM and a gradient-boosted tree model tracks when the LLM is likely to fail, especially in an inverse pattern where the LLM performs worst precisely when the tree model is most certain. A simple calibrator built on this divergence signal supplies patient-level reliability estimates that replace the uninformative verbal scores and sharply improve calibration. The work frames the problem as a cold-start issue for LLMs on tabular data and demonstrates that external model comparison can produce usable epistemic awareness without repeated sampling or access to internal states.

Core claim

The central claim is that cross-model attribution divergence serves as a usable proxy for an LLM's epistemic uncertainty on clinical tabular prediction; a calibrator trained on this signal reduces expected calibration error from 0.254 to 0.080, yields patient-specific reliability estimates, and does so without model internals or repeated inference.

What carries the argument

The Attribution Disagreement Score (ADS) derived from comparing LLM and XGBoost feature attributions, which feeds a cross-model calibrator that outputs reliability estimates.

If this is right

  • LLM verbalized confidence tracks prompt format rather than prediction quality and stays in a narrow high band even when accuracy falls to 49 percent.
  • An inverse difficulty effect appears: LLM accuracy drops when the XGBoost model is near-certain, yet matches the tree model when the latter is only moderately confident.
  • Few-shot examples and SHAP-derived feature evidence act as orthogonal, super-additive interventions that jointly cut the Attribution Disagreement Score and raise accuracy.
  • The calibrator supplies patient-specific reliability without requiring repeated model calls or internal logit access.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested on other tabular domains such as finance or sensor data to check whether attribution divergence remains informative outside clinical settings.
  • Replacing the XGBoost reference model with a different non-LLM baseline might reveal whether the signal is tied to tree-based structure or works more generally.
  • The approach suggests that production clinical systems could maintain a lightweight tree model in parallel with the LLM solely to monitor when the LLM's outputs are likely to be unreliable.
  • If the divergence signal proves stable across prompt variations, it could serve as a lightweight audit layer for any LLM deployed on structured inputs.

Load-bearing premise

That divergence in attributions between the LLM and XGBoost measures the LLM's epistemic uncertainty rather than merely reflecting differences in model architecture or training data.

What would settle it

On a fresh clinical tabular dataset, if the divergence-based calibrator fails to produce lower expected calibration error than raw verbalized confidence while the LLM's accuracy remains comparable, the proxy claim would be falsified.

read the original abstract

Large language models (LLMs) are increasingly applied to structured clinical data, yet whether they can recognize the limits of their own knowledge on such tasks remains unexplored. We study this question through the lens of cross-model attribution divergence with the goal of reducing epistemic uncertainty for structured tasks, comparing Qwen 2.5 7B and XGBoost on a prediction task via attribution divergence analysis. We report four findings. First, LLM verbalized confidence is epistemically vacuous, it outputs a near-constant (0.856-0.937) regardless of whether accuracy is 49% or 75.3%, tracking prompt format rather than prediction quality. Second, the LLM exhibits an inverse difficulty effect: accuracy drops to 64.8% when XGBoost is 99% correct, but matches XGBoost (73.8% vs. 73.1%) when it is moderately uncertain. Third, few-shot examples and SHAP-derived feature evidence are orthogonal, super-additive interventions: they reduce the Attribution Disagreement Score (ADS) from 1.54 to 0.38 and improve accuracy from 49% to 75.3% without training. Fourth, a cross-model calibrator that determined LLM reliability using attribution divergence signals reduces expected calibration error from 0.254 to 0.080, replacing uninformative verbalized confidence with patient-specific reliability estimates, without accessing model internals or requiring repeated inference. We frame these findings as a cold start problem for LLMs on structured data and outline a path toward genuine epistemic self-awareness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines LLMs on clinical tabular prediction tasks and reports that verbalized confidence is uninformative (near-constant 0.856-0.937 across accuracy levels of 49-75.3%), that an inverse difficulty effect exists (LLM accuracy 64.8% when XGBoost is 99% correct vs. matching when XGBoost is moderately uncertain), that few-shot examples and SHAP evidence are super-additive in reducing Attribution Disagreement Score (ADS) from 1.54 to 0.38 and boosting accuracy to 75.3%, and that a cross-model calibrator using ADS between Qwen 2.5 7B attributions and XGBoost SHAP values reduces expected calibration error from 0.254 to 0.080 without model internals or repeated inference.

Significance. If the central calibration result holds after addressing controls, the work would offer a practical, training-free method for patient-specific reliability estimates on structured clinical data, addressing a documented gap in LLM epistemic awareness for tabular tasks and providing falsifiable empirical patterns (inverse difficulty, super-additivity) that could guide future self-calibration research.

major comments (2)
  1. [Abstract] Abstract (fourth finding) and the cross-model calibrator claim: the reduction in ECE from 0.254 to 0.080 is presented as evidence that ADS proxies LLM epistemic uncertainty, but the manuscript provides no same-architecture control (e.g., LLM-vs-LLM attribution divergence) or ablation isolating epistemic signal from fixed differences in inductive bias, optimization, and feature handling between Qwen 2.5 7B and XGBoost; without this, the divergence and the reported calibration benefit may reflect model-type mismatch rather than epistemic blind spots.
  2. [Abstract] Abstract (second finding on inverse difficulty effect): the reported accuracy drop to 64.8% when XGBoost is 99% correct is load-bearing for the epistemic interpretation, yet the abstract supplies no statistical test, confidence intervals, or dataset size that would allow assessment of whether this pattern is robust or an artifact of the specific model pair.
minor comments (2)
  1. [Abstract] Abstract: quantitative results (accuracy, ECE, ADS values) are reported without any dataset description, patient cohort size, feature count, or attribution method implementation details, which hinders reproducibility assessment even if the full text supplies them.
  2. [Abstract] Abstract: the invented term 'Attribution Disagreement Score (ADS)' is introduced without an explicit formula or normalization in the summary paragraph, requiring the reader to infer its construction from later text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We respond point-by-point to the major concerns below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract (fourth finding) and the cross-model calibrator claim: the reduction in ECE from 0.254 to 0.080 is presented as evidence that ADS proxies LLM epistemic uncertainty, but the manuscript provides no same-architecture control (e.g., LLM-vs-LLM attribution divergence) or ablation isolating epistemic signal from fixed differences in inductive bias, optimization, and feature handling between Qwen 2.5 7B and XGBoost; without this, the divergence and the reported calibration benefit may reflect model-type mismatch rather than epistemic blind spots.

    Authors: We agree that a same-architecture control would help isolate whether the divergence signal is specifically epistemic rather than arising from differences in model class. Our experimental design intentionally pairs the LLM with XGBoost because the latter is a strong, widely used baseline for tabular clinical data; the practical goal is to detect LLM epistemic blind spots relative to such a reference model. Nevertheless, the referee's point is valid. In revision we will add an explicit discussion of this limitation and include, where data permit, a supplementary LLM-to-LLM attribution divergence ablation to quantify the contribution of architectural mismatch. revision: partial

  2. Referee: [Abstract] Abstract (second finding on inverse difficulty effect): the reported accuracy drop to 64.8% when XGBoost is 99% correct is load-bearing for the epistemic interpretation, yet the abstract supplies no statistical test, confidence intervals, or dataset size that would allow assessment of whether this pattern is robust or an artifact of the specific model pair.

    Authors: The full manuscript contains the dataset sizes and reports the accuracy figures with supporting statistics. We will revise the abstract to include the relevant sample size, confidence intervals, and a brief statement on the statistical assessment of the inverse difficulty effect so that readers can evaluate robustness directly from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical comparisons are self-contained

full rationale

The paper's claims rest on direct empirical measurements: verbalized confidence ranges, accuracy under varying XGBoost certainty, ADS reductions from interventions, and ECE drop from 0.254 to 0.080 via a cross-model signal. No equations, parameters, or results are defined in terms of themselves; attribution divergence is computed from independent model outputs rather than fitted to the target reliability metric. No self-citations or uniqueness theorems appear in the provided text. The derivation chain consists of observable comparisons and interventions without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only review limits visibility into exact parameters; the approach rests on the assumption that attributions are comparable across model types and introduces new metrics without external validation.

axioms (1)
  • domain assumption Feature attributions computed for LLM and XGBoost are directly comparable to quantify disagreement
    Required for the attribution divergence analysis and ADS metric.
invented entities (2)
  • Attribution Disagreement Score (ADS) no independent evidence
    purpose: Quantify divergence between LLM and XGBoost feature attributions
    New metric introduced to measure disagreement
  • Cross-model calibrator no independent evidence
    purpose: Estimate LLM reliability from attribution divergence signals
    Proposed method to replace verbalized confidence

pith-pipeline@v0.9.1-grok · 5831 in / 1296 out tokens · 49129 ms · 2026-06-26T20:53:19.043788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 8 canonical work pages · 3 internal anchors

  1. [1]

    Journal of the American Medical Informatics Association , volume=

    Large language models are less effective at clinical prediction tasks than locally trained machine learning models , author=. Journal of the American Medical Informatics Association , volume=. 2025 , publisher=

  2. [2]

    Advances in neural information processing systems , volume=

    Why do tree-based models still outperform deep learning on typical tabular data? , author=. Advances in neural information processing systems , volume=

  3. [3]

    International conference on machine learning , pages=

    On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

  4. [4]

    International conference on artificial intelligence and statistics , pages=

    Tabllm: Few-shot classification of tabular data with large language models , author=. International conference on artificial intelligence and statistics , pages=. 2023 , organization=

  5. [5]

    Jin, Jiayu and others , journal=

  6. [6]

    2023 , publisher=

    Johnson, Alistair EW and Bulgarelli, Lucas and Shen, Lu and Gayles, Alvin and Shammout, Ayad and Horng, Steven and Pollard, Tom J and Hao, Sicheng and Moody, Benjamin and Gow, Brian and others , journal=. 2023 , publisher=

  7. [7]

    arXiv preprint arXiv:2202.01602 , year=

    The disagreement problem in explainable machine learning: A practitioner's perspective , author=. arXiv preprint arXiv:2202.01602 , year=

  8. [8]

    2005 , publisher=

    Algorithmic learning in a random world , author=. 2005 , publisher=

  9. [9]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms , author=. arXiv preprint arXiv:2306.13063 , year=

  10. [10]

    Diagnostic and Prognostic Research , volume=

    Will large language models transform clinical prediction? , author=. Diagnostic and Prognostic Research , volume=. 2025 , publisher=

  11. [11]

    arXiv preprint arXiv:2410.14582 , year=

    Do LLMs estimate uncertainty well in instruction-following? , author=. arXiv preprint arXiv:2410.14582 , year=

  12. [12]

    Jsonformer: A bulletproof way to generate structured output from

  13. [13]

    arXiv preprint arXiv:2512.00163 , year=

    Measuring What LLMs Think They Do: SHAP Faithfulness and Deployability on Financial Tabular Classification , author=. arXiv preprint arXiv:2512.00163 , year=

  14. [14]

    International journal of medical informatics , year=

    Which risk predictors are more likely to indicate severe AKI in hospitalized patients? , author=. International journal of medical informatics , year=

  15. [15]

    Advances in neural information processing systems , volume=

    A unified approach to interpreting model predictions , author=. Advances in neural information processing systems , volume=

  16. [16]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  17. [17]

    Teaching Models to Express Their Uncertainty in Words

    Teaching models to express their uncertainty in words , author=. arXiv preprint arXiv:2205.14334 , year=

  18. [18]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  19. [19]

    Biometrika , volume=

    A new measure of rank correlation , author=. Biometrika , volume=

  20. [20]

    New Phytologist , volume=

    The distribution of the flora in the alpine zone , author=. New Phytologist , volume=

  21. [21]

    International Conference on Learning Representations , year=

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , author=. International Conference on Learning Representations , year=

  22. [22]

    arXiv preprint arXiv:2502.00290 , year=

    Estimating llm uncertainty with evidence , author=. arXiv preprint arXiv:2502.00290 , year=

  23. [23]

    TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

    TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning , author=. arXiv preprint arXiv:2505.11737 , year=

  24. [24]

    arXiv preprint arXiv:2212.13138 , year=

    Large language models encode clinical knowledge , author=. arXiv preprint arXiv:2212.13138 , year=

  25. [25]

    Informatics , volume=

    Large language models in healthcare and medical domain: A review , author=. Informatics , volume=. 2024 , organization=

  26. [26]

    Information fusion , volume=

    Tabular data: Deep learning is not all you need , author=. Information fusion , volume=. 2022 , publisher=