Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

Dominique Blok; Jonathan Kamp; Roos Bakker

arxiv: 2512.11108 · v3 · submitted 2025-12-11 · 💻 cs.CL · cs.AI

Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

Jonathan Kamp , Roos Bakker , Dominique Blok This is my paper

Pith reviewed 2026-05-16 22:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords explanation biasfeature attributionlexical biasposition biaspost-hoc explanationslanguage modelstransformersevaluation metrics

0 comments

The pith

Feature attribution methods for language models carry structured lexical and position biases that trade off against each other.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that post-hoc explanations are not neutral but contain measurable preferences for certain words over others and for tokens in certain positions in the input. It builds a set of three metrics that separate these two bias types without depending on any particular model or attribution technique. Experiments on an artificial classification task and a natural causal-relation task reveal that models scoring high on lexical bias tend to score low on position bias, and that explanations flagged as anomalous are more likely to display strong bias. A sympathetic reader would care because many downstream uses of explanations assume they reflect the model's actual reasoning; systematic skews undermine that assumption.

Core claim

Feature attribution methods exhibit structured biases that can be isolated into lexical preferences for specific tokens and positional preferences for locations in the sequence. Using three evaluation metrics, the authors demonstrate a consistent trade-off where high lexical bias correlates with low positional bias and vice versa in comparisons between transformer models. Additionally, explanations identified as anomalous show elevated levels of these biases.

What carries the argument

A model- and method-agnostic framework of three evaluation metrics that separately quantify lexical bias (preference for particular words) and position bias (preference for particular locations) in post-hoc attributions.

If this is right

Explanations for the same input differ systematically across methods because each method-model pair favors either lexical or positional cues.
High lexical bias in one model is accompanied by low position bias, implying users cannot assume any single method is uniformly reliable.
Anomalous explanations are more likely to reflect strong underlying biases rather than faithful model behavior.
The framework permits direct, comparable assessment of bias profiles across attribution techniques on both artificial and natural data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Combining attributions from multiple methods could offset the observed trade-off and yield more stable explanations in practice.
Position bias may be especially consequential in tasks that depend on long-range order, such as causal inference over extended text.
The metrics could be extended to other explanation families beyond feature attribution to test whether the same lexical-position trade-off appears.
If the biases prove persistent, training objectives that penalize both lexical and positional skew could be explored as a mitigation.

Load-bearing premise

The three proposed evaluation metrics accurately isolate lexical and position biases without introducing measurement artifacts of their own.

What would settle it

Running the three metrics on a broader set of models and tasks and finding either no trade-off between lexical and position bias scores or no elevated bias in anomalous explanations would falsify the central claim.

Figures

Figures reproduced from arXiv: 2512.11108 by Dominique Blok, Jonathan Kamp, Roos Bakker.

**Figure 2.** Figure 2: Bias-cons: inter-seed consistency on 10 differ [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Bias-cons: inter-seed consistency on 10 differ [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Sentence position Bias-agg (relative to input triple) in causal relation detection. Top row: positive class [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: noun-det-period biases. Different attribution methods show different top-1 preferences for lexical elements and their position in the sentence. Left: position frequency distribution. Right: lexical frequency distribution. Vertical red line indicates chance. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: period-comma biases. Different attribution methods show different top-1 preferences for lexical elements and their position in the sentence. Left: position frequency distribution. Right: lexical frequency distribution. Vertical red line indicates chance. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: unique-punctuation biases. Different attribution methods show different top-1 preferences for lexical elements and their position in the sentence. Left: position frequency distribution. Right: lexical frequency distribution. Vertical red line indicates chance. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find a trade-off between lexical and position biases in our model comparison, with models that score high on one type score low on the other. We also find signs that anomalous explanations are more likely to be biased.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's three-metric framework for lexical and position bias in explanations shows a trade-off but risks confounding in the artificial data.

read the letter

Here's the quick take: this paper gives a model-agnostic framework with three metrics to quantify lexical bias (what tokens get high attribution) and position bias (where in the sequence) in post-hoc explanations. They apply it to two transformers and report a trade-off where high lexical bias comes with low position bias and vice versa, plus some signs that anomalous explanations carry more bias. The new part is the structured three-metric setup that tries to separate those two types of bias instead of just noting inconsistencies between methods like Integrated Gradients. Testing on a controlled artificial classification task and then a semi-controlled natural causal detection task is a solid way to build from simple to more realistic cases. The main concern is whether the metrics cleanly separate the biases. The stress test note points out that in the artificial task, the data generation might link specific lexical items to positions, which could create the observed trade-off as an artifact rather than a real model difference. If that's the case, the natural data results become more important, but the paper should demonstrate that the metrics avoid such confounding. The anomaly bias link also needs clearer statistical backing to hold up. This work targets researchers in explainable AI for NLP who deal with attribution methods and want tools to assess explanation reliability. Someone building or auditing explanation systems could pick up the metrics and test them on their own models. I would send it out for peer review. The core idea is worth referee time to check the metric definitions and results, even with the potential issues in the artificial setup.

Referee Report

3 major / 2 minor

Summary. The paper introduces a model- and method-agnostic framework of three evaluation metrics to quantify lexical bias (preference for specific tokens) and position bias (preference for locations in the input sequence) in post-hoc feature attribution methods such as Integrated Gradients. It applies these metrics to two transformer models across a controlled pseudo-random classification task on artificial data and a semi-controlled causal relation detection task on natural data, reporting a trade-off in which models scoring high on one bias type score low on the other, along with indications that anomalous explanations exhibit greater bias.

Significance. If the metrics can be shown to isolate lexical and positional preferences without residual confounding from data-generation artifacts, the work supplies a practical diagnostic tool for characterizing systematic biases in explanation methods. The reported trade-off and anomaly correlation, if robust, would offer actionable guidance for selecting or refining attribution techniques in NLP, addressing user mistrust in explanations. The controlled artificial task provides a useful testbed, though generalization to typical real-world use cases remains to be established.

major comments (3)

[§4.1] §4.1 (artificial data generation): the pseudo-random classification procedure assigns labels via controlled rules, yet the manuscript does not demonstrate that lexical items are statistically independent of position; any residual correlation would artifactually induce the reported negative correlation between lexical and position bias scores, directly undermining the central trade-off claim.
[§3] §3 (metric definitions): the three proposed metrics are described at a high level but lack explicit equations or pseudocode; without these, it is impossible to verify that they separate lexical preference from positional preference rather than capturing correlated artifacts of the attribution method or tokenization.
[Results] Results (anomaly analysis): the claim that anomalous explanations are more likely to be biased is stated without a precise definition of 'anomaly,' without quantitative thresholds, and without reported statistical tests or effect sizes, leaving the association unsupported by the evidence presented.

minor comments (2)

[Abstract] Abstract: the phrase 'signs that anomalous explanations are more likely to be biased' is vague; a brief quantitative qualifier or reference to the relevant figure would improve clarity.
[§3] The manuscript would benefit from a table summarizing the three metrics, their formulas, and the exact bias scores obtained for each model and task.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript's rigor and clarity.

read point-by-point responses

Referee: [§4.1] §4.1 (artificial data generation): the pseudo-random classification procedure assigns labels via controlled rules, yet the manuscript does not demonstrate that lexical items are statistically independent of position; any residual correlation would artifactually induce the reported negative correlation between lexical and position bias scores, directly undermining the central trade-off claim.

Authors: We appreciate this observation on potential confounding. The pseudo-random procedure in §4.1 was constructed with controlled rules specifically to decouple lexical items from positions. To rigorously substantiate this and support the trade-off claim, we will add an explicit statistical verification of independence (e.g., correlation coefficients or mutual information scores) between lexical features and positions in the revised manuscript. revision: yes
Referee: [§3] §3 (metric definitions): the three proposed metrics are described at a high level but lack explicit equations or pseudocode; without these, it is impossible to verify that they separate lexical preference from positional preference rather than capturing correlated artifacts of the attribution method or tokenization.

Authors: We agree that greater formality is needed for verifiability. In the revised version, we will supply explicit mathematical equations for each of the three metrics together with pseudocode for their computation, making transparent how lexical and positional preferences are isolated. revision: yes
Referee: [Results] Results (anomaly analysis): the claim that anomalous explanations are more likely to be biased is stated without a precise definition of 'anomaly,' without quantitative thresholds, and without reported statistical tests or effect sizes, leaving the association unsupported by the evidence presented.

Authors: We acknowledge the need for precision here. We will revise the anomaly analysis to include a clear operational definition of anomalous explanations, specify the quantitative thresholds employed, and report statistical tests with effect sizes to support the association with higher bias. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metrics applied to controlled tasks

full rationale

The paper defines three evaluation metrics for lexical and positional bias in feature attributions, then reports observed scores and trade-offs from experiments on a pseudo-random artificial classification task and a semi-controlled natural causal-relation task. No equations, fitted parameters, derivations, or self-citations appear in the provided text; the central trade-off claim is an empirical observation across models rather than a quantity that reduces to its own inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework is described only as model- and method-agnostic without detailing background assumptions.

pith-pipeline@v0.9.0 · 5475 in / 1094 out tokens · 57646 ms · 2026-05-16T22:41:17.770409+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three evaluation metrics... Bias-cons... Bias-agg... Bias-attr... Jensen-Shannon distance... trade-off between lexical and position biases
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

position bias... lexical bias... top-1 token explanations... frequency distributions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

InICLR 2024 Workshop on Secure and Trustworthy Large Language Models

The effect of model size on LLM post-hoc explainability via LIME. InICLR 2024 Workshop on Secure and Trustworthy Large Language Models. Alon Jacovi, Hendrik Schuff, Heike Adel, Ngoc Thang Vu, and Yoav Goldberg. 2023. Neighboring words affect human interpretation of saliency explanations. InFindings of the Association for Computational Linguistics: ACL 202...

work page 2024
[2]

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6190–6197, Singapore

Dynamic top-k estimation consolidates dis- agreement between feature attribution methods. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6190–6197, Singapore. Association for Computa- tional Linguistics (ACL). Jonathan Kamp, Lisa Beinborn, and Antske Fokkens

work page 2023
[3]

InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Evalu- ation (LREC-COLING 2024), pages 16066–16078, Torino, Italia

The role of syntactic span preferences in post- hoc explanation disagreement. InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Evalu- ation (LREC-COLING 2024), pages 16066–16078, Torino, Italia. ELRA and ICCL. Miyoung Ko, Jinhyuk Lee, Hyunjae Kim, Gangwoo Kim, and Jaewoo Kang. 2020. Look at th...

work page arXiv 2024
[4]

why should I trust you?

“why should I trust you?”: Explaining the pre- dictions of any classifier. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demon- strations, pages 97–101, San Diego, California. As- sociation for Computational Linguistics. Avanti Shrikumar, Peyton Greenside, and Anshul Kun- daje. 2017. L...

work page 2016
[5]

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

PMLR. Guido A Veldhuis, Dominique Blok, Maaike HT de Boer, Gino J Kalkman, Roos M Bakker, and Rob PM van Waas. 2024. From text to model: Lever- aging natural language processing for system dynam- ics model development.System Dynamics Review, 40(3):e1780. Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Al...

work page internal anchor Pith review arXiv 2024
[6]

Rethinking cooperative rationalization: In- trospective extraction and complement control. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4094–4103. 10 Wei Zhou, Heike Adel, Hendrik Schuff, and Ngoc Thang Vu. 2024. Exp...

work page 2019

[1] [1]

InICLR 2024 Workshop on Secure and Trustworthy Large Language Models

The effect of model size on LLM post-hoc explainability via LIME. InICLR 2024 Workshop on Secure and Trustworthy Large Language Models. Alon Jacovi, Hendrik Schuff, Heike Adel, Ngoc Thang Vu, and Yoav Goldberg. 2023. Neighboring words affect human interpretation of saliency explanations. InFindings of the Association for Computational Linguistics: ACL 202...

work page 2024

[2] [2]

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6190–6197, Singapore

Dynamic top-k estimation consolidates dis- agreement between feature attribution methods. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6190–6197, Singapore. Association for Computa- tional Linguistics (ACL). Jonathan Kamp, Lisa Beinborn, and Antske Fokkens

work page 2023

[3] [3]

InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Evalu- ation (LREC-COLING 2024), pages 16066–16078, Torino, Italia

The role of syntactic span preferences in post- hoc explanation disagreement. InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Evalu- ation (LREC-COLING 2024), pages 16066–16078, Torino, Italia. ELRA and ICCL. Miyoung Ko, Jinhyuk Lee, Hyunjae Kim, Gangwoo Kim, and Jaewoo Kang. 2020. Look at th...

work page arXiv 2024

[4] [4]

why should I trust you?

“why should I trust you?”: Explaining the pre- dictions of any classifier. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demon- strations, pages 97–101, San Diego, California. As- sociation for Computational Linguistics. Avanti Shrikumar, Peyton Greenside, and Anshul Kun- daje. 2017. L...

work page 2016

[5] [5]

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

PMLR. Guido A Veldhuis, Dominique Blok, Maaike HT de Boer, Gino J Kalkman, Roos M Bakker, and Rob PM van Waas. 2024. From text to model: Lever- aging natural language processing for system dynam- ics model development.System Dynamics Review, 40(3):e1780. Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Al...

work page internal anchor Pith review arXiv 2024

[6] [6]

Rethinking cooperative rationalization: In- trospective extraction and complement control. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4094–4103. 10 Wei Zhou, Heike Adel, Hendrik Schuff, and Ngoc Thang Vu. 2024. Exp...

work page 2019