Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution
Pith reviewed 2026-05-16 22:41 UTC · model grok-4.3
The pith
Feature attribution methods for language models carry structured lexical and position biases that trade off against each other.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Feature attribution methods exhibit structured biases that can be isolated into lexical preferences for specific tokens and positional preferences for locations in the sequence. Using three evaluation metrics, the authors demonstrate a consistent trade-off where high lexical bias correlates with low positional bias and vice versa in comparisons between transformer models. Additionally, explanations identified as anomalous show elevated levels of these biases.
What carries the argument
A model- and method-agnostic framework of three evaluation metrics that separately quantify lexical bias (preference for particular words) and position bias (preference for particular locations) in post-hoc attributions.
If this is right
- Explanations for the same input differ systematically across methods because each method-model pair favors either lexical or positional cues.
- High lexical bias in one model is accompanied by low position bias, implying users cannot assume any single method is uniformly reliable.
- Anomalous explanations are more likely to reflect strong underlying biases rather than faithful model behavior.
- The framework permits direct, comparable assessment of bias profiles across attribution techniques on both artificial and natural data.
Where Pith is reading between the lines
- Combining attributions from multiple methods could offset the observed trade-off and yield more stable explanations in practice.
- Position bias may be especially consequential in tasks that depend on long-range order, such as causal inference over extended text.
- The metrics could be extended to other explanation families beyond feature attribution to test whether the same lexical-position trade-off appears.
- If the biases prove persistent, training objectives that penalize both lexical and positional skew could be explored as a mitigation.
Load-bearing premise
The three proposed evaluation metrics accurately isolate lexical and position biases without introducing measurement artifacts of their own.
What would settle it
Running the three metrics on a broader set of models and tasks and finding either no trade-off between lexical and position bias scores or no elevated bias in anomalous explanations would falsify the central claim.
Figures
read the original abstract
Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find a trade-off between lexical and position biases in our model comparison, with models that score high on one type score low on the other. We also find signs that anomalous explanations are more likely to be biased.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a model- and method-agnostic framework of three evaluation metrics to quantify lexical bias (preference for specific tokens) and position bias (preference for locations in the input sequence) in post-hoc feature attribution methods such as Integrated Gradients. It applies these metrics to two transformer models across a controlled pseudo-random classification task on artificial data and a semi-controlled causal relation detection task on natural data, reporting a trade-off in which models scoring high on one bias type score low on the other, along with indications that anomalous explanations exhibit greater bias.
Significance. If the metrics can be shown to isolate lexical and positional preferences without residual confounding from data-generation artifacts, the work supplies a practical diagnostic tool for characterizing systematic biases in explanation methods. The reported trade-off and anomaly correlation, if robust, would offer actionable guidance for selecting or refining attribution techniques in NLP, addressing user mistrust in explanations. The controlled artificial task provides a useful testbed, though generalization to typical real-world use cases remains to be established.
major comments (3)
- [§4.1] §4.1 (artificial data generation): the pseudo-random classification procedure assigns labels via controlled rules, yet the manuscript does not demonstrate that lexical items are statistically independent of position; any residual correlation would artifactually induce the reported negative correlation between lexical and position bias scores, directly undermining the central trade-off claim.
- [§3] §3 (metric definitions): the three proposed metrics are described at a high level but lack explicit equations or pseudocode; without these, it is impossible to verify that they separate lexical preference from positional preference rather than capturing correlated artifacts of the attribution method or tokenization.
- [Results] Results (anomaly analysis): the claim that anomalous explanations are more likely to be biased is stated without a precise definition of 'anomaly,' without quantitative thresholds, and without reported statistical tests or effect sizes, leaving the association unsupported by the evidence presented.
minor comments (2)
- [Abstract] Abstract: the phrase 'signs that anomalous explanations are more likely to be biased' is vague; a brief quantitative qualifier or reference to the relevant figure would improve clarity.
- [§3] The manuscript would benefit from a table summarizing the three metrics, their formulas, and the exact bias scores obtained for each model and task.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript's rigor and clarity.
read point-by-point responses
-
Referee: [§4.1] §4.1 (artificial data generation): the pseudo-random classification procedure assigns labels via controlled rules, yet the manuscript does not demonstrate that lexical items are statistically independent of position; any residual correlation would artifactually induce the reported negative correlation between lexical and position bias scores, directly undermining the central trade-off claim.
Authors: We appreciate this observation on potential confounding. The pseudo-random procedure in §4.1 was constructed with controlled rules specifically to decouple lexical items from positions. To rigorously substantiate this and support the trade-off claim, we will add an explicit statistical verification of independence (e.g., correlation coefficients or mutual information scores) between lexical features and positions in the revised manuscript. revision: yes
-
Referee: [§3] §3 (metric definitions): the three proposed metrics are described at a high level but lack explicit equations or pseudocode; without these, it is impossible to verify that they separate lexical preference from positional preference rather than capturing correlated artifacts of the attribution method or tokenization.
Authors: We agree that greater formality is needed for verifiability. In the revised version, we will supply explicit mathematical equations for each of the three metrics together with pseudocode for their computation, making transparent how lexical and positional preferences are isolated. revision: yes
-
Referee: [Results] Results (anomaly analysis): the claim that anomalous explanations are more likely to be biased is stated without a precise definition of 'anomaly,' without quantitative thresholds, and without reported statistical tests or effect sizes, leaving the association unsupported by the evidence presented.
Authors: We acknowledge the need for precision here. We will revise the anomaly analysis to include a clear operational definition of anomalous explanations, specify the quantitative thresholds employed, and report statistical tests with effect sizes to support the association with higher bias. revision: yes
Circularity Check
No circularity: empirical metrics applied to controlled tasks
full rationale
The paper defines three evaluation metrics for lexical and positional bias in feature attributions, then reports observed scores and trade-offs from experiments on a pseudo-random artificial classification task and a semi-controlled natural causal-relation task. No equations, fitted parameters, derivations, or self-citations appear in the provided text; the central trade-off claim is an empirical observation across models rather than a quantity that reduces to its own inputs by construction. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
three evaluation metrics... Bias-cons... Bias-agg... Bias-attr... Jensen-Shannon distance... trade-off between lexical and position biases
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
position bias... lexical bias... top-1 token explanations... frequency distributions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
InICLR 2024 Workshop on Secure and Trustworthy Large Language Models
The effect of model size on LLM post-hoc explainability via LIME. InICLR 2024 Workshop on Secure and Trustworthy Large Language Models. Alon Jacovi, Hendrik Schuff, Heike Adel, Ngoc Thang Vu, and Yoav Goldberg. 2023. Neighboring words affect human interpretation of saliency explanations. InFindings of the Association for Computational Linguistics: ACL 202...
work page 2024
-
[2]
Dynamic top-k estimation consolidates dis- agreement between feature attribution methods. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6190–6197, Singapore. Association for Computa- tional Linguistics (ACL). Jonathan Kamp, Lisa Beinborn, and Antske Fokkens
work page 2023
-
[3]
The role of syntactic span preferences in post- hoc explanation disagreement. InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Evalu- ation (LREC-COLING 2024), pages 16066–16078, Torino, Italia. ELRA and ICCL. Miyoung Ko, Jinhyuk Lee, Hyunjae Kim, Gangwoo Kim, and Jaewoo Kang. 2020. Look at th...
-
[4]
“why should I trust you?”: Explaining the pre- dictions of any classifier. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demon- strations, pages 97–101, San Diego, California. As- sociation for Computational Linguistics. Avanti Shrikumar, Peyton Greenside, and Anshul Kun- daje. 2017. L...
work page 2016
-
[5]
PMLR. Guido A Veldhuis, Dominique Blok, Maaike HT de Boer, Gino J Kalkman, Roos M Bakker, and Rob PM van Waas. 2024. From text to model: Lever- aging natural language processing for system dynam- ics model development.System Dynamics Review, 40(3):e1780. Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Al...
work page internal anchor Pith review arXiv 2024
-
[6]
Rethinking cooperative rationalization: In- trospective extraction and complement control. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4094–4103. 10 Wei Zhou, Heike Adel, Hendrik Schuff, and Ngoc Thang Vu. 2024. Exp...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.