The Attribution Contract: Feature Attribution for Generative Language Models

Giang Nguyen

arxiv: 2605.23080 · v2 · pith:ZX442SPBnew · submitted 2026-05-21 · 💻 cs.LG

The Attribution Contract: Feature Attribution for Generative Language Models

Giang Nguyen This is my paper

Pith reviewed 2026-05-25 05:29 UTC · model grok-4.3

classification 💻 cs.LG

keywords feature attributiongenerative language modelsautoregressive modelsdiffusion modelsexplainable AIattribution contractexplanatory contracts

0 comments

The pith

Generative language models require an explicit Attribution Contract to make feature attribution claims meaningful.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that feature attribution in generative language models suffers from ambiguity about what counts as a feature, since earlier tokens serve as both outputs and inputs while diffusion proceeds through iterative states rather than fixed sequences. This ambiguity is framed as a conceptual limitation of importing classifier-style methods, not a technical detail. The Attribution Contract is proposed to name five elements of any attribution claim so that the same algorithm can be seen to answer different questions under different assumptions. A sympathetic reader would care because the contract reframes many literature disagreements as mismatches in unstated premises rather than conflicts over algorithms themselves.

Core claim

We argue that this ambiguity is not merely an implementation detail, but a conceptual limitation of carrying classifier-era feature attribution directly into generative language modeling. We introduce the Attribution Contract, a specification for feature-attribution claims that names what output is being explained, which features are eligible to receive attribution, what generative process is assumed, what is held fixed, and what model score is being attributed. The contract clarifies why the same attribution method can answer different questions depending on how it is instantiated. We argue that many disagreements about feature attribution in generative language models are not disagreements

What carries the argument

The Attribution Contract, a five-element specification that defines the output explained, eligible features, generative process, fixed elements, and model score for any attribution claim.

If this is right

The same attribution algorithm produces different insights depending on the chosen contract.
Attribution to earlier generated tokens is informative only when the contract treats those tokens as eligible features.
In diffusion models, local explanations can target intermediate denoising states when the contract so specifies.
Feature-attribution methods must be evaluated as method-contract pairs rather than in isolation.
Clarifying contracts makes apparent conflicts in the literature traceable to differing premises.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standardizing contract declarations in published work could make explanation results across papers directly comparable.
New attribution techniques could be developed that are optimized for particular contract choices rather than claimed to be contract-agnostic.
In applied settings such as debugging generated text, requiring contract statements upfront might reduce misinterpretation of which parts of the input the model actually relied on.

Load-bearing premise

That naming the five elements of the Attribution Contract resolves the conceptual limitation and that observed disagreements arise primarily from unstated contracts rather than algorithmic differences or evaluation choices.

What would settle it

A controlled comparison in which multiple papers explicitly declare their attribution contracts yet continue to produce incompatible conclusions about the same model and output would falsify the claim that unstated contracts are the main source of disagreement.

Figures

Figures reproduced from arXiv: 2605.23080 by Giang Nguyen.

**Figure 1.** Figure 1: The same attribution method produces different attribution maps under different settings. All three rows use the same model, prompt, generation, and attribution method (Integrated Gradients [13]). Only the setting differs. Top: a local next-token setting attributes the prediction of noir over both prompt and prefix tokens, and mass concentrates on the generated prefix Le chien est because it is predictive … view at source ↗

read the original abstract

Feature attribution methods promise to identify which input features matter for a model output. In generative language models, however, it is often unclear what should count as a feature in the first place. In autoregressive language models, earlier generated tokens are both outputs of the model and inputs to later predictions. In diffusion language models, generation proceeds through iterative denoising or unmasking rather than fixed left-to-right prediction, so local explanation may target a state of diffusion rather than a next token. We argue that this ambiguity is not merely an implementation detail, but a conceptual limitation of carrying classifier-era feature attribution directly into generative language modeling. We introduce the Attribution Contract, a specification for feature-attribution claims that names what output is being explained, which features are eligible to receive attribution, what generative process is assumed, what is held fixed, and what model score is being attributed. The contract clarifies why the same attribution method can answer different questions depending on how it is instantiated. We argue that many disagreements about feature attribution in generative language models are not disagreements about attribution algorithms, but about unstated explanatory contracts. Using autoregressive and diffusion language models as case studies, we show when attribution to earlier generated tokens, intermediate states, or denoising stages is informative, when it is misleading, and why feature-attribution methods in generative language models should be evaluated as method-contract pairs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a useful five-part taxonomy for feature attribution in generative models but provides no evidence that unstated contracts drive most literature disagreements rather than baselines or metrics.

read the letter

The Attribution Contract names five elements—what output is explained, which features can be attributed, the assumed generative process, what stays fixed, and the model score—to make explicit choices that were left vague when moving from classifiers to autoregressive or diffusion language models. That framing is the main new piece. It correctly flags that tokens in autoregressive generation are both outputs and later inputs, and that diffusion steps involve states rather than fixed sequences, so standard attribution methods can target the wrong thing depending on the setup. The case studies in the abstract illustrate when attributing to prior tokens or denoising stages is informative versus misleading, which is a clear way to show the ambiguity in practice. The paper does well at organizing existing practices into a shared specification without adding new algorithms or equations. The central claim, however, is that many disagreements in the literature stem from unstated contracts rather than algorithmic differences or evaluation choices. The available text offers only illustrative examples, with no quantitative comparison or isolation of contract mismatch as the dominant factor. Without that, the proposal functions as a taxonomy rather than a demonstrated fix. The argument is conceptual and rests on the assumption that naming the elements will resolve the limitation, but the text does not test whether that naming actually reduces variance across methods. This work is aimed at interpretability researchers working on generative language models who need a common language for their explanations. A reader already familiar with attribution methods in LLMs will find the distinctions useful for designing or critiquing experiments. It deserves peer review because the problem it identifies is real and the framework is straightforward, even though the paper would benefit from empirical checks on whether the contract accounts for observed disagreements.

Referee Report

2 major / 1 minor

Summary. The manuscript argues that feature attribution methods developed for classifiers encounter a conceptual limitation when applied to generative language models, due to ambiguities in what counts as a 'feature' (e.g., prior tokens serving as both outputs and inputs in autoregressive models, or intermediate diffusion states). It introduces the Attribution Contract—a five-element specification naming the output explained, eligible features, assumed generative process, what is held fixed, and the model score attributed—as a way to make implicit assumptions explicit. The central claim is that many literature disagreements concern unstated contracts rather than algorithms themselves, and that attribution methods should be evaluated as method-contract pairs, illustrated through case studies on autoregressive and diffusion language models.

Significance. If the framework is adopted, it could provide a useful taxonomy for clarifying explanatory assumptions in generative-model interpretability, encouraging more precise communication and evaluation. The emphasis on method-contract pairs offers a conceptual tool that might reduce certain classes of misinterpretation, though its significance is tempered by the absence of evidence that contract mismatch is the primary source of observed disagreements.

major comments (2)

[Abstract] Abstract: the claim that 'many disagreements about feature attribution in generative language models are not disagreements about attribution algorithms, but about unstated explanatory contracts' is load-bearing for the paper's contribution, yet the manuscript provides no systematic analysis, citation count, or categorization of existing literature disagreements to demonstrate that contract mismatch dominates over algorithmic differences, baseline choices, or evaluation metrics.
[Case studies] Case studies section: the illustrative examples show scenarios where attribution to prior tokens or diffusion states is informative versus misleading, but contain no quantitative isolation (e.g., controlled ablation or variance decomposition) of contract specification effects from confounding factors such as normalization or baseline selection, leaving the assertion that the contract resolves the claimed conceptual limitation unsupported.

minor comments (1)

The five elements of the Attribution Contract are described narratively; formalizing them with explicit notation or pseudocode would improve precision and ease of application.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'many disagreements about feature attribution in generative language models are not disagreements about attribution algorithms, but about unstated explanatory contracts' is load-bearing for the paper's contribution, yet the manuscript provides no systematic analysis, citation count, or categorization of existing literature disagreements to demonstrate that contract mismatch dominates over algorithmic differences, baseline choices, or evaluation metrics.

Authors: We agree that the central claim would be strengthened by more explicit evidence from the literature. The manuscript's argument is primarily conceptual, using the Attribution Contract to identify sources of ambiguity and illustrating them via case studies. In revision we will add a targeted discussion (in the introduction or a new subsection) with specific citations to published attribution results on generative models where outcome differences align with differing implicit contracts (e.g., next-token vs. full-sequence explanation, or token vs. diffusion-state features) rather than algorithmic or baseline choices. A comprehensive citation count or exhaustive categorization remains outside the paper's scope, but concrete examples will be supplied. revision: partial
Referee: [Case studies] Case studies section: the illustrative examples show scenarios where attribution to prior tokens or diffusion states is informative versus misleading, but contain no quantitative isolation (e.g., controlled ablation or variance decomposition) of contract specification effects from confounding factors such as normalization or baseline selection, leaving the assertion that the contract resolves the claimed conceptual limitation unsupported.

Authors: The case studies are designed as qualitative illustrations of the conceptual issues the Attribution Contract addresses. We acknowledge that quantitative isolation of contract effects from other factors would offer additional support; however, defining and measuring 'conceptual resolution' via controlled ablations or variance decomposition is not straightforward and would require new metrics beyond the paper's conceptual contribution. The examples demonstrate when attributions become misleading under mismatched contracts and why method-contract pairs are the appropriate unit of evaluation. No quantitative experiments will be added in revision. revision: no

Circularity Check

0 steps flagged

No significant circularity: definitional framework with no equations, fitted predictions, or load-bearing self-citations.

full rationale

The paper introduces the Attribution Contract as a five-element specification (output explained, eligible features, generative process, what is held fixed, model score attributed) to clarify feature attribution in generative language models. This is presented as a conceptual taxonomy rather than a mathematical derivation. No equations appear in the provided text that could reduce outputs to inputs by construction. The claim that disagreements arise from unstated contracts is argued via case studies on autoregressive and diffusion models, without statistical fitting or self-citation chains that force the result. The framework does not rename known results or smuggle ansatzes; it functions as an organizational proposal. Per the rules, this is a normal non-finding of circularity (score 0-2) for a self-contained definitional work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces one new conceptual entity (the Attribution Contract) and relies on the domain assumption that specifying explanatory contracts addresses the identified conceptual limitation. No free parameters or additional invented entities are described.

axioms (1)

domain assumption Feature attribution methods become well-defined in generative models once the explanatory contract is explicitly stated.
This assumption underpins the claim that disagreements are about unstated contracts.

invented entities (1)

Attribution Contract no independent evidence
purpose: A specification that names output, features, generative process, fixed elements, and attributed score for any feature-attribution claim.
New conceptual object introduced to resolve ambiguity; no independent evidence outside the paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5761 in / 1241 out tokens · 32376 ms · 2026-05-25T05:29:24.818655+00:00 · methodology

The Attribution Contract: Feature Attribution for Generative Language Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)