pith. machine review for the scientific record. sign in

arxiv: 2605.06979 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 2 theorem links

· Lean Theorem

PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction

Arya Datla, Jonathn Chang, Ziv Goldfeld

Pith reviewed 2026-05-11 00:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords causal abstractionoptimal transportmechanistic interpretabilityneural networksintervention analysislocalization
0
0 comments X

The pith

Optimal transport couplings between intervention effects localize causal variables in neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PLOT to locate the parts of a neural network that realize specific causal variables from a high-level model. It does this by solving an optimal transport problem that pairs the output changes caused by intervening on abstract variables with the changes caused by intervening on candidate neural sites. The resulting coupling gives a soft correspondence that can be turned into intervention handles, applied in stages from large regions to small groups of neurons, and used to initialize faster searches. This matters because exhaustive search over neural sites becomes too slow for big models, while transport matching stays fast and accurate.

Core claim

PLOT fits an optimal transport coupling between abstract variables and candidate neural sites based on the geometry of their intervention output effects, yielding a global soft correspondence that identifies the neural handles for causal variables and supports progressive refinement or guidance of other methods.

What carries the argument

Optimal transport coupling over intervention output geometries, which establishes soft alignments between abstract causal variables and neural sites.

If this is right

  • In simple cases a single coupling over neurons gives accurate handles quickly.
  • When guiding DAS, it reaches similar accuracy with much less computation.
  • Progressive localization from tokens or layers down to finer spans makes the approach work on larger networks.
  • It offers a scalable way to perform causal abstraction analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same transport idea could apply to aligning other kinds of high-level descriptions with network internals beyond causal models.
  • If it works reliably, researchers might use it to check proposed causal structures against actual network behavior more routinely.
  • It might inspire hybrid methods that combine transport with other alignment techniques for even better efficiency.

Load-bearing premise

That the geometry of output effects from abstract and neural interventions admits a meaningful optimal transport coupling corresponding to true causal variable alignments.

What would settle it

A test case with known correct neural sites where the transport coupling selects different sites and intervening there does not match the abstract model's counterfactual predictions.

Figures

Figures reproduced from arXiv: 2605.06979 by Arya Datla, Jonathn Chang, Ziv Goldfeld.

Figure 1
Figure 1. Figure 1: PLOT as a progressive localization engine. The diagram follows one high-level variable, Z2 in red, though OT localization is performed jointly over all high-level variables and candidate neural sites. PLOT first localizes coarse sites such as tokens/layers, then refines within them to coordinates or PCA spans. The resulting signal can be calibrated into a direct handle or used to guide DAS. discovering whe… view at source ↗
Figure 2
Figure 2. Figure 2: Optimal trans￾port geometry & coupling. OT compares two probability distributions by the minimum cost of moving mass from one to the other. Given µ ∈ P(X ), ν ∈ P(Y), and a cost function c : X × Y → R, the Kantorovich OT problem is (Villani, 2003; Peyré and Cuturi, 2019) OTc(µ, ν) := inf π∈Π(µ,ν) Z c(x, y) dπ(x, y), where Π(µ, ν) := {π ∈ P(X × Y) : π(· × Y) = µ, π(X × ·) = ν} is the set of all couplings be… view at source ↗
Figure 4
Figure 4. Figure 4: HEQ comparison over 10 seeds. Values are mean ± standard deviation. Runtime excludes factual-model training and includes fitting, calibration, and test evaluation. 4.1 Hierarchical Equality (HEQ) As a first single-stage test of PLOT, we consider the HEQ benchmark from Geiger et al. (2024), where the goal is to determine whether two pairs of objects have the same equality relation. The task is small enough … view at source ↗
Figure 3
Figure 3. Figure 3: Causal model for HEQ. Causal model. As in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: HEQ site-level intervention handles learned by OT (before and after calibration) and DAS (in the canonical and rotated basis). The two high-level variables are shown side by side. accurate, with average exact score 0.995 compared to 0.991 for PLOT. The larger difference is computational: PLOT takes 4.4 seconds on average, while DAS takes 131 seconds, making it roughly 30× slower end-to-end. Runtime include… view at source ↗
Figure 6
Figure 6. Figure 6: Causal model for 4-bit ripple-carry addition. Causal model. We consider 4-bit bi￾nary addition with inputs A3A2A1A0 and B3B2B1B0, as shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Binary-addition comparison over 10 seeds. Values are mean ± standard deviation. Accuracy is averaged over the internal carry variables C1, C2, C3, and runtime excludes backbone training [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Binary-addition handles for d = 16. Rows show carries C1, C2, C3, while columns show post￾calibration PLOT handles and PCA/DAS handles in their rotated bases. Two-stage PLOT and intervention handles. We use the same output effect pipeline as in HEQ, but with binary-addition-specific pair banks and recurrent candidate sites. Following the MIB Arithmetic benchmark (Mueller et al., 2025), we construct structu… view at source ↗
Figure 9
Figure 9. Figure 9: MCQA causal model, with ab￾stract variables AP and AT. We next evaluate PLOT in the substantially larger setting of the MCQA benchmark from MIB (Mueller et al., 2025), where Gemma-2-2B answers natural-language multiple￾choice prompts. The model reads a color fact and query, identifies the answer containing the queried color, and out￾puts its symbol. For example, given “The sky is blue. What color is the sk… view at source ↗
Figure 10
Figure 10. Figure 10: Additional HEQ diagnostics: sensitivity to entropic regularization and average learned intervention size. Effect of entropic regularization. We test sensitivity to the entropic regularization parameter ε in (1) by rerunning the HEQ PLOT pipeline on the same backbone and calibration protocol while varying only ε. As shown in Figure 10a, performance is stable once ε leaves the nearly unregularized regime. V… view at source ↗
read the original abstract

Causal abstraction offers a principled framework for mechanistic interpretability, aligning a high-level causal model with the low-level computation realized by a neural network through counterfactual intervention analysis. Existing methods such as distributed alignment search (DAS) learn expressive subspace interventions, but the relevant neural site is unknown a priori, so finding a handle requires a computationally burdensome search over candidate sites. We introduce PLOT (Progressive Localization via Optimal Transport), a transport-based framework that localizes causal variables from the output effect geometry of abstract and neural interventions. PLOT fits an optimal transport coupling between abstract variables and candidate neural sites, yielding a global soft correspondence that can be calibrated into intervention handles. In simple settings, a single coupling over individual neurons suffices. In larger models, PLOT is applied progressively, moving from coarse sites such as tokens, timesteps, or layers to finer supports such as coordinate groups or PCA spans, and optionally guiding DAS based on the localized signal. Across experiments of increasing complexity, transport-only PLOT handles are exceedingly fast and competitive on accuracy, while PLOT-guided DAS reaches DAS-level accuracy at a fraction of full DAS runtime, providing an efficient localization engine for causal abstraction research at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PLOT (Progressive Localization via Optimal Transport), a framework for localizing causal variables in neural networks by fitting optimal transport couplings between abstract variables and candidate neural sites based on the geometry of their intervention effects. It describes a progressive coarse-to-fine application from tokens/layers to finer supports and optional guidance for DAS, reporting that transport-only PLOT is fast and competitive in accuracy while PLOT-guided DAS achieves DAS-level performance at reduced runtime across experiments of increasing complexity.

Significance. If the OT-based couplings reliably recover true causal alignments rather than spurious effect matches, PLOT could provide a scalable localization engine that reduces the search burden in causal abstraction methods, enabling mechanistic interpretability at larger scales. The progressive schedule and DAS integration are practical strengths, but significance hinges on validation that effect-geometry transport corresponds to interchange-intervention equivalence.

major comments (2)
  1. [Section 4] Section 4 (Experiments): the reported competitive accuracy on experiments of increasing complexity does not specify whether the metric is strict interchange-intervention equivalence (IIE) with the high-level model or a proxy based solely on output-effect matching. This distinction is load-bearing for the central claim, as the OT coupling is constructed precisely to minimize effect-geometry distance and could succeed on proxies without ensuring causal fidelity when multiple sites induce similar perturbations.
  2. [Section 3.2] Section 3.2 (Progressive Localization): the coarse-to-fine schedule is presented without analysis or safeguards against early incorrect soft alignments propagating to finer supports (e.g., coordinate groups or PCA spans). If an initial token- or layer-level coupling selects a spurious site due to similar intervention effects, the subsequent refinement cannot recover the true causal variable.
minor comments (2)
  1. [Abstract] Abstract: the claim that 'transport-only PLOT handles are exceedingly fast' lacks quantitative runtime tables or speedup factors relative to exhaustive DAS search, reducing clarity on the efficiency contribution.
  2. [Method] Notation throughout: the cost function used for the OT plan (distance between effect vectors or distributions) is not explicitly defined with an equation, making it difficult to assess uniqueness or robustness to distributed representations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and insightful comments, which help clarify the presentation of our results and the robustness of the progressive localization procedure. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (Experiments): the reported competitive accuracy on experiments of increasing complexity does not specify whether the metric is strict interchange-intervention equivalence (IIE) with the high-level model or a proxy based solely on output-effect matching. This distinction is load-bearing for the central claim, as the OT coupling is constructed precisely to minimize effect-geometry distance and could succeed on proxies without ensuring causal fidelity when multiple sites induce similar perturbations.

    Authors: We agree that this distinction is important. The accuracy numbers reported in Section 4 are computed using the standard interchange-intervention accuracy metric from the causal abstraction literature (i.e., the fraction of test inputs on which the low-level model with the learned intervention matches the high-level model under interchange interventions). This is the same metric used to evaluate DAS and is directly tied to IIE rather than a pure output-effect proxy. Nevertheless, to make the evaluation fully transparent and to address the referee’s concern about possible spurious effect matches, we will add an explicit statement of the metric in Section 4, include a short derivation showing why the OT objective is consistent with IIE under the linear-intervention assumption used in our experiments, and report an additional column of strict IIE success rates (exact match on all counterfactuals) for the main tables. These changes will appear in the revised manuscript. revision: yes

  2. Referee: [Section 3.2] Section 3.2 (Progressive Localization): the coarse-to-fine schedule is presented without analysis or safeguards against early incorrect soft alignments propagating to finer supports (e.g., coordinate groups or PCA spans). If an initial token- or layer-level coupling selects a spurious site due to similar intervention effects, the subsequent refinement cannot recover the true causal variable.

    Authors: This is a valid concern. The current manuscript presents the progressive schedule as a practical heuristic without a formal analysis of error propagation. In practice, the coarse-stage couplings are stable because intervention-effect geometries are more separable at the token/layer level, and the soft OT plan retains probability mass on multiple candidates that are then refined. To strengthen the paper, we will (i) add a paragraph in Section 3.2 discussing the conditions under which early-stage errors are unlikely (distinct effect geometries at coarse scales), (ii) include a small ablation that measures alignment stability across stages on the synthetic tasks, and (iii) describe an optional safeguard—retaining the top-k soft assignments from the coarse stage for parallel refinement at the fine stage. These additions will be incorporated in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines PLOT as a new transport-based localization procedure that computes an optimal transport coupling on intervention effect geometries and optionally guides DAS. This is a constructive definition of a method rather than a claim that reduces by construction to its inputs. No load-bearing steps invoke self-citations, fitted parameters renamed as predictions, or uniqueness theorems from prior author work. Empirical accuracy claims are presented as experimental outcomes on held-out or increasing-complexity tasks, not as tautological consequences of the fitting procedure itself. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters or axioms; the core assumption is that OT couplings on intervention geometries align with causal variables.

axioms (1)
  • domain assumption Intervention effect geometries can be meaningfully coupled via optimal transport to recover causal variable locations
    Central to the PLOT framework as stated in the abstract

pith-pipeline@v0.9.0 · 5513 in / 1110 out tokens · 43712 ms · 2026-05-11T00:57:10.495200+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

20 extracted references · 16 canonical work pages · 5 internal anchors

  1. [1]

    URL https://doi.org/10.1016/j.spa.2019.08.009

    doi: 10.1016/j.spa.2019.08.009. URL https://doi.org/10.1016/j.spa.2019.08.009. T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, an...

  2. [2]

    Maheep Chaudhary and Atticus Geiger

    URLhttps://transformer-circuits.pub/2023/monosemantic-features/ index.html. Maheep Chaudhary and Atticus Geiger. Evaluating open-source sparse autoencoders on disentangling factual knowledge in GPT-2 small.CoRR, abs/2409.04478,

  3. [3]

    Maheep Chaudhary and Atticus Geiger

    doi: 10.48550/arXiv.2409.04478. URLhttps://arxiv.org/abs/2409.04478. Patrick Cheridito and Stephan Eckstein. Optimal transport and Wasserstein distances for causal models.Bernoulli, 31(2):1351–1376,

  4. [4]

    URLhttps://doi.org/10

    doi: 10.3150/24-BEJ1773. URLhttps://doi.org/10. 3150/24-BEJ1773. Lénaïc Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Scaling algorithms for unbalanced transport problems.Mathematics of Computation, 87(314):2563–2609,

  5. [5]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs Smith, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.CoRR, abs/2309.08600,

  6. [6]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    doi: 10.48550/arXiv.2309.08600. URLhttps://arxiv.org/abs/2309.08600. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. InAdvances in Neural Information Processing Systems, volume 26,

  7. [7]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team. Gemma 2: Improving open language models at a practical size.CoRR, abs/2408.00118,

  8. [8]

    Gemma 2: Improving Open Language Models at a Practical Size

    doi: 10.48550/arXiv.2408.00118. URLhttps://arxiv.org/abs/2408.00118. Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth Interna- tional Conference on Learning Representations,

  9. [10]

    Adam: A Method for Stochastic Optimization

    URLhttps://arxiv.org/abs/1412.6980. Rémi Lassalle. Causal transport plans and their Monge–Kantorovich problems.Stochastic Analysis and Applications, 36(3):452–484,

  10. [11]

    URLhttps://doi

    doi: 10.1080/07362994.2017.1422747. URLhttps://doi. org/10.1080/07362994.2017.1422747. Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. InFirst Conference on Language Modeling,

  11. [12]

    Interpretability analysis of arithmetic in-context learning in large language models

    Gregory Polyakov, Christian Hepting, Carsten Eickhoff, and Seyed Ali Bahrainian. Interpretability analysis of arithmetic in-context learning in large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1758–1777,

  12. [13]

    URLhttps://aclanthology.org/2025.emnlp-main.92/

    doi: 10.18653/v1/2025.emnlp-main.92. URLhttps://aclanthology.org/2025.emnlp-main.92/. Jiuding Sun, Jing Huang, Sidharth Baskaran, Karel D’Oosterlinck, Christopher Potts, Michael Sklar, and Atticus Geiger. HyperDAS: Towards automating mechanistic interpretability with hypernetworks.CoRR, abs/2503.10894,

  13. [14]

    Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda

    URLhttps://arxiv.org/abs/2503.10894. Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Linear representations of sentiment in large language models.CoRR, abs/2310.15154,

  14. [15]

    J., Geiger, A., and Nanda, N

    doi: 10.48550/arXiv.2310.15154. URLhttps://arxiv.org/abs/2310.15154. Dor Tsur and Ziv Goldfeld. Neural entropic multi-marginal optimal transport and Gromov– Wasserstein alignment.CoRR, abs/2506.00573,

  15. [16]

    URL https://arxiv.org/abs/2506.00573

    doi: 10.48550/arXiv.2506.00573. URL https://arxiv.org/abs/2506.00573. 13 Cédric Villani.Topics in optimal transportation, volume 58 ofGraduate Studies in Mathematics. American Mathematical Society,

  16. [17]

    URLhttps: //doi.org/10.1109/TIT.2026.3661439

    doi: 10.1109/TIT.2026.3661439. URLhttps: //doi.org/10.1109/TIT.2026.3661439. Early access. Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah D. Goodman. Inter- pretability at scale: Identifying causal mechanisms in Alpaca.CoRR, abs/2305.08809,

  17. [18]

    14 Appendix A Additional Methodology Details We evaluate learned handles by interchange-intervention accuracy (Geiger et al., 2024)

    URL https://arxiv.org/abs/2305.08809. 14 Appendix A Additional Methodology Details We evaluate learned handles by interchange-intervention accuracy (Geiger et al., 2024). For a handle associated withZ i, calibration accuracy is Lcal(Zi) := 1 Tcal TcalX t=1 1n ynn_swap Π,i (xb t ,xs t)=yabs_swap i (xb t ,xs t) o, with test accuracy defined analogously onDt...

  18. [19]

    Indeed, a method that learns no handle can score perfectly on invariance-heavy test sets, yielding high average accuracy while being weak on sensitivity

    which mixed sensitive and invariant pairs, leading to test accuracies that are less interpretable. Indeed, a method that learns no handle can score perfectly on invariance-heavy test sets, yielding high average accuracy while being weak on sensitivity. Unless stated otherwise, main-text accuracies average over all2m sensitive and invariant test sets, two ...

  19. [20]

    to obtainℓ AP andℓ AT 3:forZ∈ {AP,AT}do 4:fork∈ {32,64,96,128,256,512,768,1024,1536,2048,2304}do 5:Train a DAS rotation forZonD ft at layerℓ Z with subspace intervention sizek 6:Evaluate the rotation handle forZonD cal 7:end for 8:Evaluate the best calibrated rotation forZonD te 9:end for 10:End timer and reportPLOT-DASruntime Algorithm 10MCQAPLOT-native-...

  20. [21]

    to obtainℓ AP,ℓ AT,R AP,R AT,H pca AP,H pca AT 3:forZ∈ {AP,AT}do 4:Extractb ⋆ andK ⋆ fromH pca Z 5:Set effective dimensione:=⌊rank(R Z)/b⋆⌋K ⋆ 6:fork∈ {0.5e,0.75e, e,1.5e,2.0e}do 7:Train a DAS rotation on top ofR Z forZonD ft (DAS rotation has sizeRk×rank(RZ )) 8:Evaluate the rotation forZonD cal 9:end for 10:Evaluate the best calibrated rotation forZonD ...