pith. sign in

arxiv: 2606.26474 · v1 · pith:45H253MJnew · submitted 2026-06-25 · 💻 cs.LG · cs.AI

Localizing RL-Induced Tool Use to a Single Crosscoder Feature

Pith reviewed 2026-06-26 05:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningtool usecrosscodersfeature localizationlanguage modelsmechanistic interpretabilitycapability spilloveragentic behavior
0
0 comments X

The pith

Dedicated Feature Crosscoders isolate a compact set of RL-specific features that mediate tool-calling in Qwen2.5-3B and enable runtime behavioral control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that reinforcement learning changes the internal features of language models to support tool use, but these changes have been difficult to isolate. Dedicated Feature Crosscoders extract a small group of RL-introduced features through a sweep of 48 hyperparameter settings. Reconstructing the model from only those features raises tool-calling accuracy in the RL-tuned model and gives some of the new ability to the frozen base model. This concentration of capability into a minimal, steerable set suggests a route to controlling agentic behaviors directly at inference time.

Core claim

Dedicated Feature Crosscoders (DFC) isolate a compact set of RL-specific features that mediate tool-calling capability in Qwen2.5-3B. Across a 48-crosscoder hyperparameter sweep, encode-decode reconstruction improves the RL model's tool correctness by +31.1 ± 9.7 pp and passively transfers tool-calling ability to the frozen base model by +6.8 ± 5.0 pp which we call a capability spillover. Our findings show that DFC partitioning concentrates RL-introduced capability into a minimal, steerable feature set that enables runtime behavioral control of agentic LLMs.

What carries the argument

Dedicated Feature Crosscoders (DFC), a partitioning method that separates RL-introduced features from preserved base-model features to support targeted encode-decode reconstruction.

If this is right

  • Encode-decode reconstruction with DFC features raises tool correctness by 31.1 percentage points on average in the RL model.
  • Tool-calling ability transfers passively to the base model by 6.8 percentage points through capability spillover.
  • RL-introduced capability concentrates into a minimal set of features rather than spreading diffusely.
  • The isolated features support runtime behavioral control of agentic LLMs without further training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the features prove causal, targeted interventions on them could toggle tool use on or off at inference time.
  • The same partitioning approach might localize other RL-induced behaviors such as multi-step planning.
  • Runtime feature steering could reduce reliance on full retraining for adjusting agentic model behavior.

Load-bearing premise

The encode-decode reconstruction using the identified crosscoder features is assumed to causally mediate the observed tool-calling improvements rather than merely correlating with them or capturing unrelated variance.

What would settle it

An ablation or intervention experiment on the isolated features that fails to produce the expected change in tool-calling behavior while leaving other capabilities intact would falsify the mediation claim.

Figures

Figures reproduced from arXiv: 2606.26474 by Ahmed Zeyad A Alzahrani, Andrii Shportko, Bowen Cheng, Gustavo Mercier, Jessica Hullman, Shubham Bhokare.

Figure 1
Figure 1. Figure 1: Three architectures for joint sparse decomposition of paired model activations 2. Related Work Sparse autoencoders and mechanistic interpretability. SAEs are popular as a tool for decomposing LLM activa￾tions into interpretable features (Bricken et al., 2023; Cun￾ningham et al., 2023; Templeton et al., 2024; Elhage et al., 2022). These methods are built on the hypothesis that model representations are supe… view at source ↗
Figure 2
Figure 2. Figure 2: Decoder UMAP at matched hyperparameters (D = 8,192, k = 160; nneighbors = 30, min _dist = 0.1, cosine metric, seed 42). Left: the DFC (dfc-D8k-excl10-freeexcl-k160) yields three spatially distinct regions: A-exclusive (red), B-exclusive (blue), and shared (grey). Right: the CrossCoder (cc-D8k-k160) with a mass-ratio proxy partition of identical sizes (819/819/6,554) shows A-biased and B-biased features mix… view at source ↗
Figure 3
Figure 3. Figure 3: Saturation curve: ∆ tool-correctness vs |S| (log-x) for all conditions at layer 13. DFC A-excl plateaus from |S| = 1; CrossCoder needs |S| = 33 to reach its peak. B-excl steering has zero effect throughout. Devices used: dfc-D8k-excl10-k45 and cc-D8k-k45 Steering a single A-exclusive neuron (|S| = 1, α = 32) achieves +65.0 pp ∆ tool-correctness (95% CI 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: % tool-correctness improvement per layer at |S| ≤ 10 (DFC A-excl, dfc-D8k-excl10-k45). Layers 18 and 24 are dead (no tool-signed A-excl features). Across-layers (fig. 4), the mean best-cell is ∆ = +43.3 pp (95% CI [+22.7, +64.0]), Cohen’s dz = 1.61, and paired t p = 0.0006, Wilcoxon p = 0.0078. Single-neuron satura￾tion is not a layer-13 artefact. 6.5. B-Exclusive Steering Has No Effect B-exclusive steerin… view at source ↗
Figure 6
Figure 6. Figure 6: Best ∆ vs α, pooled across DFC A-excl, shared, and CC. Moderate α (≈ 6–32) is consistently near-optimal. Over-steering (α = 64, large |S|) degrades performance. Key operating points: DFC A-excl at |S| = 1, α = 32 yields +65.0 pp; CC at |S| = 33, α = 6 yields +70.0 pp — 33× the neuron budget for a 5 pp gain. Over-steering (α = 64, |S| ≥ 9) degrades performance (e.g. |S| = 33, α = 64 → +0.0 pp). 11 [PITH_FU… view at source ↗
Figure 5
Figure 5. Figure 5: Tool-use behaviour before vs. after a single A-exclusive steering vector. Both panels show the same ToolRL-Qwen2.5-3B passed through the same DFC (dfc-D8k-excl10-k45, layer 13). Left: standard reconstruction — the model asks for clarification and never emits a <tool_call> (format_accuracy =False, tool_correctness =False). Right: the same model with an additive delta on the single highest-Cohen’s-d A-exclus… view at source ↗
Figure 7
Figure 7. Figure 7: Per-cell ∆ tool-correctness heatmaps (|S| × α) for the two representative DFCs. Sparse k = 45 peaks at targeted low-|S| steering; dense k = 160 benefits from broader coverage. E. Models and Repository Our code can be found at https://github.com/Antebe/model_diffing_crosscoders. The full sweep is now available at https://huggingface.co/collections/antebe1/qwen-toolrl-crosscoder. 13 [PITH_FULL_IMAGE:figures… view at source ↗
read the original abstract

Fine-tuning through RL reshapes the internal representations of language models to enable agentic behaviors such as tool use, yet the mechanistic basis of these changes remains poorly understood. While RL substantially improves structured tool-call generation, it is unclear which features emerge, which are preserved, and whether identified features can be leveraged for retraining-free behavioral control. In this work, we show that $\textit{Dedicated Feature Crosscoders (DFC)}$ isolate a compact set of RL-specific features that mediate tool-calling capability in $\texttt{Qwen2.5-3B}$. Across a $48$-crosscoder hyperparameter sweep, encode-decode reconstruction improves the RL model's tool correctness by $+31.1 \pm {9.7}$ pp and passively transfers tool-calling ability to the frozen base model by $+6.8 \pm 5.0$ pp which we call a $\textit{capability spillover}$. Our findings show that DFC partitioning concentrates RL-introduced capability into a minimal, steerable feature set that enables runtime behavioral control of agentic LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that Dedicated Feature Crosscoders (DFC) isolate a compact set of RL-specific features mediating tool-calling capability in Qwen2.5-3B. A 48-crosscoder hyperparameter sweep shows that encode-decode reconstruction using these features raises tool correctness by +31.1 ± 9.7 pp on the RL-tuned model and produces a +6.8 ± 5.0 pp spillover on the frozen base model, enabling runtime behavioral control without retraining.

Significance. If the mediation claim is substantiated, the work would offer a practical route to localize and steer RL-induced agentic capabilities via crosscoders, advancing mechanistic interpretability of post-training changes. The scale of the hyperparameter sweep and the reported spillover effect are concrete strengths that could support falsifiable follow-up experiments on feature necessity.

major comments (2)
  1. [Abstract] Abstract: The central claim that the DFC-isolated features 'mediate' tool-calling capability rests on encode-decode reconstruction gains alone. No ablation, feature-knockout, path-intervention, or necessity test in the original forward pass is referenced, leaving open whether the performance lift arises from the specific features, the reconstruction operator, or correlated but non-causal variance.
  2. [Abstract] Abstract: The reported improvements (+31.1 pp and +6.8 pp) are presented without description of baselines, control conditions, statistical tests, data exclusion criteria, or variance across seeds, making it impossible to assess whether the numbers support the causal mediation interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments regarding the interpretation of our results as mediation and the completeness of experimental reporting in the abstract. We respond to each point below, indicating planned revisions where the manuscript can be strengthened without overstating the current evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the DFC-isolated features 'mediate' tool-calling capability rests on encode-decode reconstruction gains alone. No ablation, feature-knockout, path-intervention, or necessity test in the original forward pass is referenced, leaving open whether the performance lift arises from the specific features, the reconstruction operator, or correlated but non-causal variance.

    Authors: We agree that the primary evidence consists of encode-decode reconstruction gains rather than direct necessity tests such as feature knockouts or path interventions performed in the unmodified forward pass of the original model. The 48-crosscoder sweep provides some robustness by demonstrating that gains are configuration-dependent and concentrated in particular DFC features. Nevertheless, this leaves open the possibility that the lift could partly reflect properties of the reconstruction operator itself. We will revise the abstract and discussion sections to replace 'mediate' with language such as 'localize' or 'are sufficient to recover' and to explicitly note the absence of forward-pass necessity tests as a limitation of the current work. revision: yes

  2. Referee: [Abstract] Abstract: The reported improvements (+31.1 pp and +6.8 pp) are presented without description of baselines, control conditions, statistical tests, data exclusion criteria, or variance across seeds, making it impossible to assess whether the numbers support the causal mediation interpretation.

    Authors: The reported ± values are the standard deviations computed across the 48 hyperparameter configurations in the sweep; this is the variability measure used in the study. The full manuscript describes the baseline (tool correctness of the unreconstructed RL model) and the control condition (spillover measured on the frozen base model). We will expand the abstract to briefly reference these baselines and controls. Data exclusion followed the standard protocol of the tool-use benchmark. However, the experiments did not include multiple random seeds independent of the hyperparameter sweep, so variance across seeds cannot be added; we will note this as a limitation in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical outcomes from hyperparameter sweep

full rationale

The paper reports results from a 48-crosscoder hyperparameter sweep in which encode-decode reconstruction using DFC features produces measured gains in tool correctness (+31.1 pp on the RL model, +6.8 pp spillover on the base model). These are presented as direct experimental outcomes rather than quantities derived from the inputs by definition, fitted parameters renamed as predictions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are shown that reduce the central claim to its own inputs; the derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no concrete free parameters, axioms, or invented entities; the DFC method itself appears introduced here but cannot be audited for fitting details or background assumptions.

pith-pipeline@v0.9.1-grok · 5737 in / 1061 out tokens · 43146 ms · 2026-06-26T05:39:49.636008+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 1 canonical work pages

  1. [1]

    Transformer Circuits Thread , year=

    Toy Models of Superposition , author=. Transformer Circuits Thread , year=

  2. [2]

    Transformer Circuits Thread , year=

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. Transformer Circuits Thread , year=

  3. [3]

    arXiv preprint arXiv:2309.08600 , year=

    Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. arXiv preprint arXiv:2309.08600 , year=

  4. [4]

    Scaling Monosemanticity: Extracting Interpretable Features from

    Templeton, Adly and Conerly, Tom and Marcus, Jonathan and others , journal=. Scaling Monosemanticity: Extracting Interpretable Features from. 2024 , url=

  5. [5]

    Transformer Circuits Thread , year=

    Sparse Crosscoders for Cross-Layer Features and Model Diffing , author=. Transformer Circuits Thread , year=

  6. [6]

    Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

    Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs , author=. Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

  7. [7]

    2025 , eprint=

    ToolRL: Reward is All Tool Learning Needs , author=. 2025 , eprint=

  8. [8]

    Acikgoz, Emrecan and others , year=

  9. [9]

    arXiv preprint arXiv:2412.15115 , year=

  10. [10]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  11. [11]

    arXiv preprint arXiv:2502.04382 , year=

    Validation Methods for Neural Network Interpretability , author=. arXiv preprint arXiv:2502.04382 , year=

  12. [12]

    User-Assistant Bias in

    Pan, Xu and Fan, Jingxuan and Xiong, Zidi and Hahami, Ely and Overwiening, Jorin and Xie, Ziqian , journal =. User-Assistant Bias in

  13. [13]

    Locating and Editing Factual Associations in

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and Editing Factual Associations in

  14. [14]

    Representation Engineering: A Top-Down Approach to

    Zou, Andy and Phan, Long and Chen, Sarah and others , journal=. Representation Engineering: A Top-Down Approach to

  15. [15]

    2025 , howpublished =

    Aranguri, Santiago , title =. 2025 , howpublished =

  16. [16]

    OpenAI Blog , year=

    Language Models Can Explain Neurons in Language Models , author=. OpenAI Blog , year=

  17. [17]

    Penedo, Guilherme and Kydl. The. Advances in Neural Information Processing Systems , year =

  18. [18]

    2026 , eprint=

    Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes , author=. 2026 , eprint=

  19. [19]

    Knowledge Boundary of Large Language Models: A Survey

    Li, Moxin and Zhao, Yong and Zhang, Wenxuan and Li, Shuaiyi and Xie, Wenya and Ng, See-Kiong and Chua, Tat-Seng and Deng, Yang. Knowledge Boundary of Large Language Models: A Survey. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.256