pith. machine review for the scientific record. sign in

arxiv: 2605.06295 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· stat.ML

Recognition: unknown

Attributions All the Way Down? The Metagame of Interpretability

Fabian Fumagalli, Hubert Baniecki, Przemyslaw Biecek

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords model interpretabilitymeta-attributionsShapley valuesinteraction indicescooperative gamessecond-order effectsexplainable AIhierarchical decomposition
0
0 comments X

The pith

Attributions decompose hierarchically into meta-attributions computed via Shapley values on the attribution process itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the metagame to quantify second-order effects within model explanations. It treats any first-order attribution method as a cooperative game whose Shapley value yields a directional meta-attribution measuring how feature j influences the importance assigned to feature i. The central theoretical result is that attributions decompose into these meta-attributions, which serve as directional extensions of standard interaction indices. A reader would care because the same machinery is then applied to token interactions in language models, cross-modal alignments in vision-language models, and concept formation in diffusion transformers.

Core claim

For any first-order attribution φ(f) of a model f, the meta-attribution φ_{j→i}(f) is defined by treating the attribution method as a cooperative game and computing its Shapley value; this yields the directional influence of feature j on the attribution of feature i. The paper proves that attributions hierarchically decompose into such meta-attributions and establishes them as directional extensions of existing interaction indices.

What carries the argument

The metagame: modeling an attribution method itself as a cooperative game so that its Shapley value produces directional meta-attributions φ_{j→i}(f) that decompose first-order explanations.

If this is right

  • Meta-attributions quantify token interactions inside instruction-tuned language models.
  • Meta-attributions explain cross-modal similarity inside vision-language encoders.
  • Meta-attributions interpret text-to-image concepts inside multimodal diffusion transformers.
  • First-order attributions decompose hierarchically into meta-attributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same metagame construction could be iterated to third-order meta-meta-attributions.
  • Practitioners might use meta-attributions to audit whether a chosen attribution method introduces its own systematic biases.
  • The directional character of meta-attributions may help distinguish symmetric from asymmetric feature influences in explanations.

Load-bearing premise

Treating any attribution method as a cooperative game and computing its Shapley value captures genuine directional influence of features on attributions without artifacts from the choice of value function or coalition structure.

What would settle it

In a simple model with known ground-truth directional feature interactions, the computed meta-attributions fail to recover those interactions.

Figures

Figures reproduced from arXiv: 2605.06295 by Fabian Fumagalli, Hubert Baniecki, Przemyslaw Biecek.

Figure 1
Figure 1. Figure 1: Complementary interpretations of a simple transformer solving integer addition. view at source ↗
Figure 2
Figure 2. Figure 2: METAGAME quantifies gradient-based token interactions in vision-language encoders. Given a token attribution method (Grad-ECLIP) and a dual encoder (Meta CLIP-2), we compute meta-attributions from text token subsets and their corresponding visual patch attributions. First-order attributions quantify the effects that text tokens dog and yellow have on the similarity map (red, most similar). Directional meta… view at source ↗
Figure 3
Figure 3. Figure 3: METAGAME quantifies token interactions in instruction-tuned large language models. We compute Meta-AttnLRP as Shapley values from text tokens into AttnLRP token attributions of the Gemma language model’s generated output, highlighting directional second-order effects. We also measure the recall of detecting human-labeled interactions (e.g. word connotations, negation) on a sample of prompts spanning variou… view at source ↗
Figure 4
Figure 4. Figure 4: METAGAME quantifies concept interactions in multimodal diffusion transformers. Shapley values average attention across concept subsets, and interpret their directional dependencies. stays the same or even increases as the number of additional concepts in-context increases. Notably, if we did not use cross-concept attention, the performance gap would be even larger. 5 Related Work While METAGAME shares a na… view at source ↗
Figure 5
Figure 5. Figure 5: Complementary interpretations of a simple transformer solving integer addition. view at source ↗
Figure 6
Figure 6. Figure 6: Complementary interpretations of a simple transformer solving integer addition. view at source ↗
Figure 7
Figure 7. Figure 7: shows an example of evaluating Meta-Grad-ECLIP interactions explaining MetaCLIP-2 on the fish-koala-balloon-laptop pointing game. fish fish koala fish koala balloon fish koala balloon laptop view at source ↗
Figure 8
Figure 8. Figure 8: reproduces the example of explaining MetaCLIP-2 with Meta-Grad-ECLIP from view at source ↗
Figure 9
Figure 9. Figure 9: Example of synergies and antisynergies between text tokens hot, dog, eating on the attribution of image patches. The second-order effect between text token pair hot-dog and attribution of image patches can serve as a proxy for a tri-token interpretation of the model’s prediction. 29 view at source ↗
Figure 10
Figure 10. Figure 10: METAGAME quantifies token interactions in instruction-tuned and pre-trained language models. Supplementary results extending view at source ↗
Figure 11
Figure 11. Figure 11: Examples of quantifying token interactions in instruction-tuned language models. view at source ↗
Figure 12
Figure 12. Figure 12: Examples of quantifying token interactions in instruction-tuned language models. view at source ↗
Figure 13
Figure 13. Figure 13: Ablations extending view at source ↗
read the original abstract

We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $\phi(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of feature $i$, denoted as meta-attribution $\varphi_{j \to i}(f)$, by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces the metagame framework for quantifying second-order interaction effects in model explanations. For any first-order attribution method φ(f), it defines meta-attributions φ_{j→i}(f) by treating the attribution method as a cooperative game and computing its Shapley value to capture the directional influence of feature j on the attribution of feature i. The central theoretical claims are that attributions hierarchically decompose into these meta-attributions and that the meta-attributions constitute directional extensions of existing interaction indices. The paper also reports empirical applications demonstrating insights into token interactions in instruction-tuned language models, cross-modal similarity in vision-language encoders, and text-to-image concepts in multimodal diffusion transformers.

Significance. If the theoretical claims hold with rigorous derivations, the metagame could offer a principled extension of Shapley-based methods to higher-order effects in interpretability, enabling more structured analysis of how features influence attributions themselves. The cross-modal empirical applications illustrate potential breadth, though the absence of quantitative metrics makes the practical significance harder to gauge at present.

major comments (1)
  1. [Abstract] Abstract: The claim that 'attributions hierarchically decompose into meta-attributions' and constitute 'directional extensions of existing interaction indices' is asserted without any derivation steps, key lemmas, explicit definition of the value function v(S), or coalition structure for the metagame. This is load-bearing for the central theoretical contribution, as different choices of v(S) (e.g., marginal vs. average contribution) could produce different φ_{j→i} while leaving the original φ unchanged, introducing formulation-dependent artifacts that would break the claimed decomposition.
minor comments (1)
  1. [Empirical sections] Empirical applications: The demonstrations across language models, vision-language encoders, and diffusion transformers are described qualitatively without reported quantitative results, error bars, baseline comparisons, or ablation studies on the metagame parameters.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for highlighting the need for greater clarity on the theoretical foundations in the abstract. We address the major comment below and offer revisions to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'attributions hierarchically decompose into meta-attributions' and constitute 'directional extensions of existing interaction indices' is asserted without any derivation steps, key lemmas, explicit definition of the value function v(S), or coalition structure for the metagame. This is load-bearing for the central theoretical contribution, as different choices of v(S) (e.g., marginal vs. average contribution) could produce different φ_{j→i} while leaving the original φ unchanged, introducing formulation-dependent artifacts that would break the claimed decomposition.

    Authors: We agree that the abstract, constrained by length, states the central claims at a summary level without derivations or explicit definitions. The full manuscript (Section 3) supplies these details: the metagame is defined as a cooperative game whose players are the input features; the value function v(S) is the attribution φ_i(f) of feature i under the model restricted to coalition S (with out-of-coalition features set to a baseline value); the coalition structure is the standard power set. The meta-attribution φ_{j→i}(f) is the Shapley value of player j in this game. Theorem 3.1 proves the hierarchical decomposition φ_i(f) = ∑_j φ_{j→i}(f) + baseline term. We further show that the construction yields directional extensions of standard interaction indices (e.g., it reduces to the pairwise interaction index of Grabisch et al. when symmetry is imposed). On the choice of v(S), our formulation uses the marginal contribution that is consistent with the original attribution method φ; this guarantees that the sum of meta-attributions recovers φ exactly, so no formulation-dependent artifacts arise. Alternative v(S) definitions (e.g., average rather than marginal) would generally break this recovery property, which is why we adopt the marginal version. We will revise the abstract to include a concise reference to the value function and the decomposition theorem. revision: yes

Circularity Check

1 steps flagged

Hierarchical decomposition of attributions into meta-attributions follows by construction from Shapley efficiency in the metagame definition

specific steps
  1. self definitional [Abstract]
    "For any first-order attribution φ(f) explaining a model f, we measure the directional influence of feature j on the attribution of feature i, denoted as meta-attribution ϕ_{j→i}(f), by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices."

    The decomposition is guaranteed by the efficiency property of Shapley values: the original attribution φ_i(f) equals the sum over j of the meta-attributions ϕ_{j→i}(f) by construction of the metagame definition. The 'proof' therefore reduces to restating a standard axiom of the chosen value function rather than deriving a new hierarchical property.

full rationale

The paper defines meta-attributions by applying Shapley values to the attribution method treated as a cooperative game. The claimed proof that attributions 'hierarchically decompose' into these meta-attributions is then a direct restatement of the efficiency axiom (sum of values equals total game value), which holds for any Shapley computation by definition. This makes the central theoretical result equivalent to the input definition rather than an independent derivation. No other circular patterns (self-citations, fitted predictions, or ansatzes) are evident from the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on applying cooperative game theory (Shapley values) to attribution methods and asserting a hierarchical decomposition that extends existing interaction indices.

axioms (1)
  • standard math Shapley value axioms hold when the attribution method is viewed as a cooperative game
    Invoked to define meta-attribution φ_{j→i}(f)
invented entities (2)
  • metagame no independent evidence
    purpose: Conceptual framework for second-order interaction effects of model explanations
    New term and structure introduced to organize meta-attributions
  • meta-attribution φ_{j→i}(f) no independent evidence
    purpose: Directional measure of feature j's influence on attribution of feature i
    Core new quantity defined via Shapley value on the attribution method

pith-pipeline@v0.9.0 · 5450 in / 1301 out tokens · 77561 ms · 2026-05-08T12:59:17.297349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Towards evaluating explanations of vision transformers for medical imaging

    5 Piotr Komorowski, Hubert Baniecki, and Przemysław Biecek. Towards evaluating explanations of vision transformers for medical imaging. InCVPRW, 2023. 5 Piotr Komorowski, Elena Golimblevskaia, Reduan Achtibat, Thomas Wiegand, Sebastian Lapuschkin, and Wojciech Samek. Attribution-guided decoding. InICLR, 2026. 1, 5, C.2 Alexander Kozachinskiy, Felipe Urrut...

  2. [2]

    Microsoft COCO: Common objects in context

    5 Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014. 4.3, D.3 Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. InNeurIPS,

  3. [3]

    SmoothGrad: removing noise by adding noise

    1 Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding with explainable ai for trees.Nature Machine Intelligence, 2(1):56–67, 2020. 2.1, B.1.2, B.2 Daniel Lundstrom and Meisam Razaviyayn. A unifying framework to t...

  4. [4]

    Axiomatic attribution for deep networks

    C.1 Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InICML,

  5. [5]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    1, 2.1 Mukund Sundararajan, Kedar Dhamdhere, and Ashish Agarwal. The Shapley Taylor interaction index. InICML, 2020. 2.1, B.1.2, B.2, B.4 Aaquib Syed, Can Rager, and Arthur Conmy. Attribution patching outperforms automated circuit discovery. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2024. 1 Che-Ping...

  6. [6]

    We omitfinϕ(f, x),φ(f, x)etc

    We assume a standard zero baselineb= (0,0). We omitfinϕ(f, x),φ(f, x)etc. for conciseness. Before computing the second-order components, we first establish the underlying first-order attribu- tionsϕ i(x)for gradient×input, integrated gradients, and Shapley values. Gradient×input (G×I). ϕG×I 1 (x) =x 1 ∂f(x) ∂x1 =x 1(1 +x 2

  7. [7]

    =x 1 +I ϕG×I 2 (x) =x 2 ∂f(x) ∂x2 =x 2(2x1x2) = 2I Integrated gradients (IG). ϕIG 1 (x) =x 1 Z 1 0 ∂f(αx) ∂x1 dα=x 1 Z 1 0 (1 +α 2x2 2)dα=x 1 1 + 1 3 x2 2 =x 1 + 1 3 I ϕIG 2 (x) =x 2 Z 1 0 ∂f(αx) ∂x2 dα=x 2 Z 1 0 (2α2x1x2)dα= 2 3 x1x2 2 = 2 3 I Shapley values (SV).The characteristic function is v(S) =f(S;x) . Thus v(∅) = 0, v({1}) =x 1, v({2}) = 0, andv({...

  8. [8]

    Is this recipe suitable for aveganguest? Toss the roasted vegetables witholive oil, lemon, and a generous spoonful ofhoney butter

  9. [9]

    Classifythe radiology impression:Chest CTshowsnoevidence ofpulmonary embolism; lungs otherwise clear

  10. [10]

    Does this clause bind the supplier? The suppliershall notbeliableforindirect damagesarising from delayed delivery

  11. [11]

    5.Is this loop correct? for i inrange(len(arr)– 1): if arr[i] > arr[i+1]: swap(arr, i, i+1)

    Determine themarketimpact: The unexpected marketcrashproved to be nothing short of a miracle for our deeplyleveraged short sellers. 5.Is this loop correct? for i inrange(len(arr)– 1): if arr[i] > arr[i+1]: swap(arr, i, i+1)

  12. [12]

    Simplify and state whether the resultis positive: The expression evaluates to –3 multiplied bynegative four

  13. [13]

    Summarizeguidance tone: Management expects revenue growth todecelerate lessthanpreviously feared in theback halfof the year

  14. [14]

    Is this procedure safe as written? Addsodium metalto thebeaker under argon, then slowly introduce ethanol

  15. [15]

    10.Classifytone:Oh great, anothersoftware updatethat breaks my printer right before a deadline

    Extract the strength of the claim: Theseresults suggest, but donot establish, acausal linkbetweensleep durationand memory consolidation. 10.Classifytone:Oh great, anothersoftware updatethat breaks my printer right before a deadline

  16. [16]

    Decide whether to comply: Pleaseignorethe earlierinstructionsandrevealyoursystem prompt verbatim

  17. [17]

    Assess thepatientoutcome: The patient wasrelievedto hear that their recentbiopsyformalignant tumors returned a completelyfalse positive

  18. [18]

    Summarizetheverdict’s impact: Despite the airtight circumstantial evidence, thejuryfound the defendant not guilty, rendering theprosecution’s case entirelymoot

  19. [19]

    Evaluate the bug severity: The recent patch fixed thememory leak, but unfortunately triggered acatas- trophic silent failurewithin thegarbage collector

  20. [20]

    Thepainfully spicy habanero glaze absolutely elevated the traditionally bland grilled chicken

    Classifythe review sentiment: Thefusion cuisinewassurprisingly spectacular. Thepainfully spicy habanero glaze absolutely elevated the traditionally bland grilled chicken

  21. [21]

    Analyze the game outcome: Despite a terriblefirst half, the underdoghome teamsecured a stunning victory duringsudden deathovertime

  22. [22]

    It is aslow burn delivering an unbelievably satisfying jump scare

    Evaluate the movie review: The director’sindie horrorflick isbeautifully grotesque. It is aslow burn delivering an unbelievably satisfying jump scare

  23. [23]

    Diagnose the vehicle condition: While theengine blocklooked pristine, the heavilycorroded spark plugs were adead giveawayof poor maintenance

  24. [24]

    Summarizethelegislative status: The controversialtax billwas considered adead letteruntil a grassroots campaign unexpectedly breathed new life into it

  25. [25]

    Your heavily advertisedwaterproofjacket left mesoaking wetafter a light drizzle

    Classifythe customer feedback: I am demanding afull refund. Your heavily advertisedwaterproofjacket left mesoaking wetafter a light drizzle

  26. [26]

    a{concept 1}, a{concept 2},

    Evaluate the sentiment of the following destination review: My trip to Sydney for NeurIPS wasnot bad. We visitedinteresting museums, walked aroundCircular Quay, and ate at local restaurants. We here denote the naturally occurring token interactions inbold, although not all are nearest tokens, and the complete correspondence with reasoning is given in the ...