pith. machine review for the scientific record. sign in

arxiv: 2604.13258 · v1 · submitted 2026-04-14 · 💻 cs.CL · cs.AI

Recognition: unknown

Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs

Maisha Maliha, Nathaniel D. Bastian, Sumit Kumar Jha, Vishal Pramanik

Pith reviewed 2026-05-10 15:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords token attributionautoregressive LLMsmodel interpretabilityHessian sensitivityKL divergencedecoder-only modelsattribution faithfulnessgenerative benchmarks
0
0 comments X

The pith

HETA improves token attributions for autoregressive language models by combining semantic transition vectors, Hessian sensitivities, and KL divergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HETA to explain which input tokens shape the outputs of decoder-only language models. Existing attribution methods rely on linear approximations designed for encoder architectures and miss the sequential causal dynamics of autoregressive generation. HETA adds a semantic transition vector to track token influence across layers, Hessian-based scores to capture second-order sensitivity effects, and KL divergence to quantify information loss from masking tokens. Evaluations across models and datasets show higher faithfulness scores and stronger agreement with human annotations than prior techniques. This matters because clearer explanations can help users understand and verify model behavior in text generation tasks.

Core claim

HETA is a unified attribution framework for decoder-only language models that integrates a semantic transition vector capturing token-to-token influence across layers, Hessian-based sensitivity scores modeling second-order effects, and KL divergence measuring information loss when tokens are masked. This produces context-aware, causally faithful, and semantically grounded attributions. Empirical tests across multiple models and datasets show consistent outperformance over existing methods on faithfulness metrics and alignment with human annotations, while also introducing a curated benchmark dataset for generative attribution evaluation.

What carries the argument

The HETA framework, which unifies a semantic transition vector, Hessian-based sensitivity scores, and KL divergence measurements to quantify each token's contribution during autoregressive generation.

If this is right

  • Attributions produced by HETA align more closely with human judgments on generated text than prior methods.
  • The framework generalizes across multiple decoder-only models and evaluation datasets.
  • A new benchmark dataset enables systematic comparison of attribution quality in generative settings.
  • HETA addresses shortcomings of encoder-focused linear techniques for causal autoregressive processes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • HETA attributions could help trace which tokens trigger specific outputs such as factual errors or biased responses.
  • The Hessian component might extend to second-order analysis in other neural network interpretability tasks.
  • The introduced benchmark could become a reference standard for testing future attribution methods on generative models.

Load-bearing premise

That the combination of semantic transition vectors, Hessian sensitivities, and KL divergence captures the causal and semantic complexities of autoregressive generation more effectively than linear approximations.

What would settle it

Direct head-to-head tests on the paper's benchmark dataset where HETA fails to exceed baseline methods on faithfulness metrics or human annotation agreement would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.13258 by Maisha Maliha, Nathaniel D. Bastian, Sumit Kumar Jha, Vishal Pramanik.

Figure 1
Figure 1. Figure 1: Overview of HETA. The pipeline (a) rolls out attention–value flows that end at the target token to form a causal gate over input tokens, (b) estimates token-level curvature via scalable Hessian–vector products to capture nonlinear interactions, and (c) measures KL-based information impact under token masking. The final attribution combines causal gating, curvature sensitivity, and information gain to produ… view at source ↗
Figure 2
Figure 2. Figure 2: (a)-(c) Analysis of HETA components. Each bar plot shows the effect of ablating key com￾ponents of HETA. The full HETA model achieves the highest attribution faithfulness and alignment across all metrics, while removing individual components consistently degrades performance.(d) Input importance distributions for a generative task using our proposed HETA method. (a) (b) (c) 敏捷的 棕色 狐狸 跳过 懒 狗 敏捷的 棕色 狐狸 跳过 懒 … view at source ↗
Figure 3
Figure 3. Figure 3: (a)-(c) Analysis of robustness of HETA vs baseline methods (Left) Sensitivity under Gaussian perturbations (lower is better), where HETA maintains the lowest variance across input noise. (Center) Active/Passive robustness (higher is better), reflecting attribution consistency across syntactic rephrasings. (Right) Alignment F1 score against annotated tokens (higher is better). HETA outperforms all baselines… view at source ↗
Figure 4
Figure 4. Figure 4: Word-level attribution visualization for predicting the final word “slice.” Each word is [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Word-level attribution visualization for predicting the final word “friends.” Bounding boxes [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Word-level attribution visualization for predicting the final word “bush.” Attribution scores [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗
read the original abstract

Attribution methods seek to explain language model predictions by quantifying the contribution of input tokens to generated outputs. However, most existing techniques are designed for encoder-based architectures and rely on linear approximations that fail to capture the causal and semantic complexities of autoregressive generation in decoder-only models. To address these limitations, we propose Hessian-Enhanced Token Attribution (HETA), a novel attribution framework tailored for decoder-only language models. HETA combines three complementary components: a semantic transition vector that captures token-to-token influence across layers, Hessian-based sensitivity scores that model second-order effects, and KL divergence to measure information loss when tokens are masked. This unified design produces context-aware, causally faithful, and semantically grounded attributions. Additionally, we introduce a curated benchmark dataset for systematically evaluating attribution quality in generative settings. Empirical evaluations across multiple models and datasets demonstrate that HETA consistently outperforms existing methods in attribution faithfulness and alignment with human annotations, establishing a new standard for interpretability in autoregressive language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Hessian-Enhanced Token Attribution (HETA) for decoder-only autoregressive LLMs. HETA integrates a semantic transition vector capturing token-to-token influence across layers, Hessian-based second-order sensitivity scores, and KL divergence for information loss under masking. The method is evaluated on a newly curated benchmark dataset, with claims that it yields more context-aware, causally faithful attributions than prior linear-approximation techniques and aligns better with human annotations across multiple models and datasets.

Significance. If the empirical superiority holds under rigorous controls, HETA would address a clear gap in interpretability methods for generative decoder-only models, moving beyond encoder-centric linear approximations. The introduction of a dedicated generative benchmark is a positive contribution that could facilitate future standardized comparisons.

major comments (2)
  1. The central empirical claim (consistent outperformance in faithfulness and human alignment) is load-bearing yet unsupported by any quantitative results, tables, error bars, or explicit baseline comparisons in the abstract or visible structure. The manuscript must supply these details (e.g., specific faithfulness metrics, statistical significance tests, and ablation results) in the experimental evaluation section to substantiate the claim that the three-component design outperforms existing methods.
  2. No discussion of the computational cost of Hessian computation appears, despite its known expense for large models. This omission affects the practicality claim; the paper should quantify runtime/memory overhead relative to baselines (e.g., in § on experiments or implementation details) and discuss approximations if used.
minor comments (2)
  1. Notation for the semantic transition vector and Hessian sensitivity scores should be defined explicitly with equations early in the method section to improve readability.
  2. The abstract asserts 'new standard' status; this phrasing should be tempered to 'promising results' pending peer validation and broader replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the two major comments point by point below and will incorporate revisions to improve the clarity and completeness of the empirical presentation and practical discussion.

read point-by-point responses
  1. Referee: The central empirical claim (consistent outperformance in faithfulness and human alignment) is load-bearing yet unsupported by any quantitative results, tables, error bars, or explicit baseline comparisons in the abstract or visible structure. The manuscript must supply these details (e.g., specific faithfulness metrics, statistical significance tests, and ablation results) in the experimental evaluation section to substantiate the claim that the three-component design outperforms existing methods.

    Authors: We appreciate the referee drawing attention to the need for explicit quantitative support. The experimental evaluation section contains the relevant results, including faithfulness metrics (insertion/deletion and human correlation scores), tables comparing HETA to baselines such as Integrated Gradients and attention rollout across models, error bars from repeated runs, and component ablations. To address visibility concerns and strengthen substantiation of the three-component design, we will revise the manuscript to add a consolidated summary table with statistical significance tests (e.g., paired t-tests) directly in the main experimental section. revision: yes

  2. Referee: No discussion of the computational cost of Hessian computation appears, despite its known expense for large models. This omission affects the practicality claim; the paper should quantify runtime/memory overhead relative to baselines (e.g., in § on experiments or implementation details) and discuss approximations if used.

    Authors: We agree that computational overhead is an important practical consideration that was not addressed. In the revised manuscript we will add a dedicated paragraph (or short subsection) in the experimental or implementation details section that reports runtime and peak memory usage of the Hessian component relative to the baselines, along with any approximations (such as diagonal or layer-wise Hessian estimates) employed to ensure scalability. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes HETA by combining a semantic transition vector, Hessian-based sensitivity scores, and KL divergence to produce attributions for decoder-only models. No equations appear in the abstract or description that define any output quantity in terms of itself or reduce a claimed prediction to a fitted input by construction. The central claims rest on empirical outperformance across models and datasets plus a new benchmark, which are external to the method definition itself. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to bear the load of the derivation. The approach therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method extends standard attribution techniques (Hessian, KL divergence) without introducing new postulated objects or fitted constants visible at this level.

pith-pipeline@v0.9.0 · 5478 in / 1146 out tokens · 53144 ms · 2026-05-10T15:19:45.289619+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 23 canonical work pages · 1 internal anchor

  1. [1]

    May 31, 2020.DOI: 10.48550/arXiv.2005.00928

    S. Abnar and W. Zuidema. Quantifying attention flow in transformers. arXiv preprint arXiv:2005.00928, 2020

  2. [2]

    On the Robustness of Interpretability Methods

    D. Alvarez-Melis and T. S. Jaakkola. On the robustness of interpretability methods. arXiv preprint arXiv:1806.08049, 2018

  3. [3]

    S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M \"u ller, and W. Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10 0 (7): 0 e0130140, 2015

  4. [4]

    Barkan, Y

    O. Barkan, Y. Toib, Y. Elisha, J. Weill, and N. Koenigstein. Llm explainability via attributive masking learning. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9522--9537, 2024

  5. [5]

    J. M. Ben \' tez, J. L. Castro, and I. Requena. Are artificial neural networks black boxes? IEEE Transactions on neural networks, 8 0 (5): 0 1156--1164, 1997

  6. [6]

    Bressan, N

    M. Bressan, N. Cesa-Bianchi, E. Esposito, Y. Mansour, S. Moran, and M. Thiessen. A theory of interpretable approximations. In The Thirty Seventh Annual Conference on Learning Theory, pages 648--668. PMLR, 2024

  7. [7]

    L. Chen, J. Bruna, and A. Bietti. Distributional associations vs in-context reasoning: A study of feed-forward and attention layers. arXiv preprint arXiv:2406.03068, 2024

  8. [8]

    Cohen-Wang, H

    B. Cohen-Wang, H. Shah, K. Georgiev, and A. Madry. Contextcite: Attributing model generation to context. Advances in Neural Information Processing Systems, 37: 0 95764--95807, 2024

  9. [9]

    Conmy, A

    A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In Advances in Neural Information Processing Systems, volume 36, 2023

  10. [10]

    Dhamdhere, M

    K. Dhamdhere, M. Sundararajan, and Q. Yan. How important is a neuron?, 2018. URL https://arxiv.org/abs/1805.12233

  11. [11]

    Z. Dong, Y. Zhang, Z.-Q. Luo, J. Yao, and R. Sun. Towards quantifying the hessian structure of neural networks. arXiv preprint arXiv:2505.02809, 2025

  12. [12]

    Z. Feng, H. Zhou, Z. Zhu, J. Qian, and K. Mao. Unveiling and manipulating prompt influence in large language models. arXiv preprint arXiv:2405.11891, 2024

  13. [13]

    M. Geva, A. Caciularu, K. R. Wang, and Y. Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680, 2022

  14. [14]

    T. Han, S. Srinivas, and H. Lakkaraju. Which explanation should i choose? a function approximation perspective to characterizing post hoc explanations. Advances in neural information processing systems, 35: 0 5256--5268, 2022

  15. [15]

    Hewitt and C

    J. Hewitt and C. D. Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129--4138, 2019

  16. [16]

    Hooker, D

    S. Hooker, D. Erhan, P.-J. Kindermans, and B. Kim. A benchmark for interpretability methods in deep neural networks. Advances in neural information processing systems, 32, 2019

  17. [17]

    Attention is not Explanation

    S. Jain and B. C. Wallace. Attention is not explanation. arXiv preprint arXiv:1902.10186, 2019

  18. [18]

    Kariyappa, F

    S. Kariyappa, F. L \'e cu \'e , S. Mishra, C. Pond, D. Magazzeni, and M. Veloso. Progressive inference: Explaining decoder-only sequence classification models using intermediate predictions. arXiv preprint arXiv:2406.02625, 2024

  19. [19]

    Kobayashi, T

    G. Kobayashi, T. Kuribayashi, S. Yokoi, and K. Inui. Attention is not only a weight: Analyzing transformers with vector norms. arXiv preprint arXiv:2004.10102, 2020

  20. [20]

    2023 , month = feb, journal =

    G. Kobayashi, T. Kuribayashi, S. Yokoi, and K. Inui. Analyzing feed-forward blocks in transformers through the lens of attention map. arXiv preprint arXiv:2302.00456, 2023

  21. [21]

    Ko c isk \`y , J

    T. Ko c isk \`y , J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6: 0 317--328, 2018

  22. [22]

    Y. K. Lal, N. Chambers, R. Mooney, and N. Balasubramanian. Tellmewhy: A dataset for answering why-questions in narratives. arXiv preprint arXiv:2106.06132, 2021

  23. [23]

    X. Li, J. Chen, Y. Chai, and H. Xiong. Gilot: Interpreting generative language models via optimal transport. In Forty-first International Conference on Machine Learning, 2024

  24. [24]

    Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu. G-Eval : NLG evaluation using GPT-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  25. [25]

    K. Lu, Z. Wang, P. Mardziel, and A. Datta. Influence patterns for explaining information flow in bert. Advances in Neural Information Processing Systems, 34: 0 4461--4474, 2021

  26. [26]

    S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017

  27. [27]

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

    P. Manakul, A. Liusie, and M. J. Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023

  28. [28]

    K. Meng, D. Bau, A. Andonian, and Y. Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35: 0 17359--17372, 2022

  29. [29]

    Mitchell, C

    E. Mitchell, C. Lin, A. Bosselut, C. D. Manning, and C. Finn. Memory-based model editing at scale. In International Conference on Machine Learning, pages 15817--15831. PMLR, 2022

  30. [30]

    Towards transparent ai: A survey on explainable large language models.arXiv preprint arXiv:2506.21812, 2025

    A. Palikhe, Z. Yu, Z. Wang, and W. Zhang. Towards transparent ai: A survey on explainable large language models. arXiv preprint arXiv:2506.21812, 2025

  31. [31]

    H. Peng, X. Wang, S. Hu, H. Jin, L. Hou, J. Li, Z. Liu, and Q. Liu. Copen: Probing conceptual knowledge in pre-trained language models. arXiv preprint arXiv:2211.04079, 2022

  32. [32]

    Phukan, S

    A. Phukan, S. Somasundaram, A. Saxena, K. Goswami, and B. V. Srinivasan. Peering into the mind of language models: An approach for attribution in contextual question answering. In Findings of the Association for Computational Linguistics: ACL 2024, pages 11481--11495, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics. doi:10.18653/v1...

  33. [33]

    M. T. Ribeiro, S. Singh, and C. Guestrin. ``why should i trust you?'' explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135--1144, 2016

  34. [34]

    Samek, A

    W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K.-R. M \"u ller. Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning Systems, 28: 0 2660--2673, 2017

  35. [35]

    Sanyal and X

    S. Sanyal and X. Ren. Discretized integrated gradients for explaining language models. arXiv preprint arXiv:2108.13654, 2021

  36. [36]

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618--626, 2017

  37. [37]

    Shrikumar, P

    A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating activation differences. In International conference on machine learning, pages 3145--3153. PMLR, 2017

  38. [38]

    Sundararajan, A

    M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319--3328. PMLR, 2017

  39. [39]

    K. Vafa, Y. Deng, D. M. Blei, and A. M. Rush. Rationales for sequential predictions. arXiv preprint arXiv:2109.06387, 2021

  40. [40]

    Wang and A

    B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model . https://github.com/kingoflolz/mesh-transformer-jax, May 2021

  41. [41]

    K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022

  42. [42]

    Welbl, N

    J. Welbl, N. F. Liu, and M. Gardner. Crowdsourcing multiple choice science questions. In Proceedings of the Workshop on Noisy User-generated Text, 2017

  43. [43]

    Y. Xu, S. Zhao, J. Song, R. Stewart, and S. Ermon. A theory of usable information under computational constraints. arXiv preprint arXiv:2002.10689, 2020

  44. [44]

    Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng, H. Chen, and N. Zhang. Editing large language models: Problems, methods, and opportunities. arXiv preprint arXiv:2305.13172, 2023

  45. [45]

    Zhao and N

    Z. Zhao and N. Aletras. Incorporating attribution importance for improving faithfulness metrics. arXiv preprint arXiv:2305.10496, 2023

  46. [46]

    Zhao and B

    Z. Zhao and B. Shan. Reagent: A model-agnostic feature attribution method for generative language models. arXiv preprint arXiv:2402.00794, 2024

  47. [47]

    Zheng, W.-L

    L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, et al. Judging LLM -as-a-judge with MT-Bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, 2023

  48. [48]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  49. [49]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  50. [50]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  51. [51]

    ,# (7),01444 '9=82<.342C 2! !22222222222222222222222222222222222222222222222222

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...