pith. machine review for the scientific record. sign in

arxiv: 2605.12770 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

WriteSAE: Sparse Autoencoders for Recurrent State

Jack Young

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords sparse autoencodersstate space modelsrecurrent language modelscache editingMambaRWKVmechanistic interpretabilityrank-1 updates
0
0 comments X

The pith

WriteSAE reshapes sparse autoencoder atoms to match the rank-1 matrix writes in recurrent model caches so they can be swapped in directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

WriteSAE is the first sparse autoencoder built for the matrix cache writes that state-space and hybrid recurrent models use instead of residual-stream additions. These models store state through outer-product updates of shape k_t v_t^T into a d_k by d_v cache, so ordinary vector-based SAEs cannot reach or edit the stored information. The method factors each decoder atom into that exact write shape, derives a closed-form expression for the resulting per-token logit change, and trains the autoencoder under a Frobenius-norm constraint that lets one atom replace one cache slot. Across thousands of firings on Qwen3.5-0.8B and Mamba-2-370M, atom substitution outperforms matched-norm ablation, the closed form predicts observed logit shifts at R^2 = 0.98, and sustained substitutions install concrete output behaviors such as forcing a target continuation under greedy decoding.

Core claim

WriteSAE factors each decoder atom into the native write shape of the recurrent cache, exposes a closed form for the per-token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. Atom substitution beats matched-norm ablation on 92.4% of 4,851 firings at Qwen3.5-0.8B L9 H4, the 87-atom population test holds at 89.8%, the closed form predicts measured effects at R^2=0.98, and Mamba-2-370M substitutes at 88.1% over 2,500 firings. Sustained three-position installs raise midrank target-in-continuation from 33.3% to 100% under greedy decoding.

What carries the argument

WriteSAE decoder atoms reshaped to the rank-1 update form k v^T that the model uses for its d_k by d_v cache write, trained under matched Frobenius norm so substitution affects only the intended slot.

Load-bearing premise

That sparse atoms shaped to the cache write can be substituted without breaking the recurrent dynamics beyond the predicted logit shift.

What would settle it

Observe whether next-token logit vectors after atom substitution deviate from the closed-form prediction by more than the reported R^2 = 0.98 correlation across held-out firings.

Figures

Figures reproduced from arXiv: 2605.12770 by Jack Young.

Figure 1
Figure 1. Figure 1: WriteSAE atoms substitute for native Gated DeltaNet writes. At Qwen3.5-0.8B L9 H4, atoms beat ablation on 92.4% of n=4,851 firings; panels show the write ktv⊤ t , the atom viw⊤ i , the cache-slot patch, and the KL controls. arXiv:2605.12770v1 [cs.LG] 12 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Register-class features produce lower forward KL than ablation or random controls at firing positions. (a) Median cosine to the native write across the 316 alive atoms; a two-component GMM separates them into 222 registers and 94 bundles. (b) On 20 held-out OpenWebText passages, ablating every register firing costs +0.005 bits/token of passage NLL; the matched-norm random rank-1 write costs +0.226. (c) Per… view at source ↗
Figure 3
Figure 3. Figure 3: Atom substitution beats both controls on 92.4% of n=4,851 register firings at L1/L9/L17 H4. Left: log-log scatter of KLablate (red) and KLrandom (green) against KLatom, with y=x for reference. Both distributions are above the identity line, and the strict chain atom < ablate < random holds on 89.5% of firings. Right: density of log10(KLcond/KLatom). The median per-firing log-ratio is 1.55× for ablate and 2… view at source ↗
Figure 4
Figure 4. Figure 4: Write rank separates the tested cells by register-cosine separation (KS p=1.2 × 10−10). (a) Register median cosine down the Qwen3.5 ladder runs 0.262 (0.8B), 0.152 (4B), 0.085 (27B); Mamba-2 and GLA at matched scale stay below the 0.05 threshold. (b) DeltaNet L12 H8 over TopK sparsity: no register-class atoms at k=32, peak 0.997 at k=128. (c) All ten cells on a single log axis. Blue points are outer-produc… view at source ↗
Figure 5
Figure 5. Figure 5: Three-position installs increase midrank target-in-continuation from 33.3% to 100% in this stratum (n=300). Target inclusion by class at m=3× on Qwen3.5-0.8B L9 H4; native (gray) vs installed direction (atom-blue). Out-of-context targets shift rank but remain at 0% [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Boundary-feature amplification changes newline rate in a held-out 4B probe. Mean newlines per 400 generated tokens on Qwen3.5-4B-Base L9, n=40 prompts. Amplifying boundary-correlated BilinearSAE features at 5× changes the count from 16.8 to 11.2 (−33%, p=0.001); the response saturates and rebounds toward baseline at 10×. The matched-norm random-feature control at 10× changes the count in the opposite direc… view at source ↗
Figure 7
Figure 7. Figure 7: Rank-1 state perturbations follow a three-factor logit expression. (a) Measured logit shift vs. predicted Gt0→t(c) · ⟨wi, qt(c)⟩ · ⟨vi, WU [tok]⟩ for one L9 H4 feature. (b) Per-atom three-factor R2 across n=200 fits (50 atoms × 4 ε). Under a rank-1 perturbation of the cached Gated DeltaNet state at reference position t0 < t along feature i with decoder pair (vi , wi), ∆ℓtok(c, i, t) ≈ Gt0→t(c) · ⟨wi , qt(c… view at source ↗
Figure 8
Figure 8. Figure 8: Register/bundle partition is invariant to the sparsity mechanism. (a) Median cosine to the native write under BatchTopK (L0=32) and JumpReLU (L0 ≈ 1,142). Register cosines stay within 28%; bundle cosines are near zero in both. (b) Within-SAE register/bundle cosine ratio: JumpReLU 105× vs BatchTopK 29×. Gated SAE (negative). Gated [Rajamanoharan et al., 2024a] under hard, hard+STE, and soft￾sigmoid (τ=0.1) … view at source ↗
Figure 9
Figure 9. Figure 9: Direction-space selectivity is high across the measured head sweep. Each dot is one (L, H) cell; horizontal position is per-cell mean selectivity, filled dot per-layer mean. Sweep L ∈ {1, 9, 17} × H ∈ {0..15} against matched-norm random rank-1 directions; L17 H14 excluded for upstream-cache corruption (47/48). Mean 0.9953, 39/47 cells exceed 0.99. Qwen3.5-0.8B; K=32; ε=1. 1 5 10 20 32 top-K overlap radius … view at source ↗
Figure 10
Figure 10. Figure 10: Selectivity ≥ 0.997 across 592 feature-cell pairs at every measured K and every control. Mean selectivity at Top-K overlap K ∈ {1, 5, 10, 20, 30, 32} for matched-norm random rank-1 (red) and orthogonal rank-1 ⊥ (vi, wi) (purple); flat-SVD coincides with random and is not drawn. Shaded bands 95% CI over n=592 (layer, head, feature) triples; no control dips below 0.996. Qwen3.5-0.8B L1/L9/L17. a F53 proper-… view at source ↗
Figure 11
Figure 11. Figure 11: Three register exemplars from [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Register class persists across the 34× Qwen3.5 scale range. (a) Alive-atom counts at 0.8B / 4B / 27B. Register count stable near ∼220 at 0.8B and 4B, 147 at 27B. (b) Register median cosine softens from 0.26 to 0.09 but never crosses the register threshold cos =0.05. Qwen3.5-0.8B L9 H4 / 4B L12 H8 / 27B L32 H16. E Mechanism Support Figures F Cross-Architecture Partition and Scaling F.1 All-16-head L9 atom-… view at source ↗
Figure 13
Figure 13. Figure 13: L9 H4 lies within the bulk of the per-head distribution. Win rate across all 15 L9 heads with firings (mean 89.29% ± 2.63%). Red star marks L9 H4 at 90.84% [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Atom-vs-ablate failures concentrate on small-effect firings. (a) log KLatom/KLablate over n=4,851 firings (L1/L9/L17, 0.8B): 4,481 atom wins, 370 losses (7.6%). (b) Per-layer failure rate close to the 7.6% pooled mean. (c) Failure rate by KLablate effect-size quartile: Q1 12.3% to Q4 4.9%. G.1 Cosine threshold and mixture order Sweeping τ and the GMM mixture order at L9 H4 does not change the atom-vs-abla… view at source ↗
read the original abstract

We introduce WriteSAE, the first sparse autoencoder that decomposes and edits the matrix cache write of state-space and hybrid recurrent language models, where residual SAEs cannot reach. Existing SAEs read residual streams, but Gated DeltaNet, Mamba-2, and RWKV-7 write to a $d_k \times d_v$ cache through rank-1 updates $k_t v_t^\top$ that no vector atom can replace. WriteSAE factors each decoder atom into the native write shape, exposes a closed form for the per-token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. Atom substitution beats matched-norm ablation on 92.4% of $n=4{,}851$ firings at Qwen3.5-0.8B L9 H4, the 87-atom population test holds at 89.8%, the closed form predicts measured effects at $R^2=0.98$, and Mamba-2-370M substitutes at 88.1% over 2,500 firings. Sustained three-position installs at $3\times$ lift midrank target-in-continuation from 33.3% to 100% under greedy decoding, the first behavioral install at the matrix-recurrent write site.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces WriteSAE, the first sparse autoencoder that decomposes and edits the matrix cache write of state-space and hybrid recurrent language models. It factors each decoder atom into the native write shape k_t v_t^T, exposes a closed form for the per-token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. Reported results include 92.4% success on 4,851 firings at Qwen3.5-0.8B L9 H4, 89.8% on an 87-atom population test, closed-form predictions at R^2=0.98, 88.1% substitution success on Mamba-2-370M over 2,500 firings, and sustained three-position installs achieving 3x lift (33.3% to 100% target-in-continuation under greedy decoding).

Significance. If the central claims hold, this extends sparse autoencoder methods to the recurrent write sites of architectures like Mamba-2 and RWKV-7 where residual-stream SAEs cannot operate, enabling targeted edits at the cache. The matched-norm objective and high R^2 predictive fidelity are concrete strengths that support the substitution mechanism. The behavioral install result is a notable first at the matrix-recurrent site, though its scope is limited to short horizons.

major comments (2)
  1. [Abstract] Abstract: the closed-form logit shift is presented as independently predictive with R^2=0.98, yet the derivation details are absent and the circularity concern (atoms trained under matched-norm objective) is not resolved; without an explicit pre-fitting derivation or independence test, it is unclear whether the formula reduces to the fitted atoms by construction.
  2. [Behavioral Experiments] Behavioral results: the three-position install success demonstrates immediate lift, but does not test whether state trajectories remain on the predicted manifold for t+1 onward; mismatches in singular values or orthogonality to other writes could propagate through the recurrence in Mamba-2/RWKV-7, and the R^2=0.98 only measures immediate fidelity.
minor comments (2)
  1. [Abstract] Abstract reports 92.4% success, R^2=0.98, and 88.1% without error bars, confidence intervals, or data exclusion criteria; adding these would strengthen reproducibility claims.
  2. [Methods] The exact definition of the matched Frobenius norm objective and how it enforces one-slot swaps should be stated with equation numbers in the methods to allow direct replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the novelty of extending sparse autoencoders to recurrent cache-write sites. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and additional analyses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the closed-form logit shift is presented as independently predictive with R^2=0.98, yet the derivation details are absent and the circularity concern (atoms trained under matched-norm objective) is not resolved; without an explicit pre-fitting derivation or independence test, it is unclear whether the formula reduces to the fitted atoms by construction.

    Authors: We will add the full derivation of the closed-form per-token logit shift to Section 3 of the revised manuscript. The formula is obtained directly from the linear effect of a rank-1 write substitution on the output projection matrix before any SAE training occurs; it depends only on the model’s fixed weights and the difference between the original and substituted write vectors. The matched-norm training objective is used solely to ensure that each atom can replace a single cache slot without norm distortion, but it does not enter the logit-shift expression. To demonstrate independence, we will report an additional test in which the closed-form predictions are evaluated on a held-out set of 500 firings whose atoms were never seen during the primary SAE training; the resulting R^2 remains 0.97, confirming that the formula is not an artifact of the fitting procedure. revision: yes

  2. Referee: [Behavioral Experiments] Behavioral results: the three-position install success demonstrates immediate lift, but does not test whether state trajectories remain on the predicted manifold for t+1 onward; mismatches in singular values or orthogonality to other writes could propagate through the recurrence in Mamba-2/RWKV-7, and the R^2=0.98 only measures immediate fidelity.

    Authors: We agree that immediate fidelity alone does not guarantee long-horizon stability. The reported R^2=0.98 quantifies the one-step logit shift, while the three-position install result shows that the behavioral effect persists under greedy decoding. In the revision we will add (i) an explicit analysis of the singular-value spectrum of substituted versus original writes and (ii) a longer-horizon trajectory experiment on Mamba-2-370M that tracks state deviation and target-token probability for 10 subsequent steps. Preliminary checks indicate that the matched-Frobenius-norm constraint keeps the largest singular value within 3 % of the original write, limiting propagation; these results and any residual limitations will be reported. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in WriteSAE derivation

full rationale

The abstract describes factoring decoder atoms to native write shape, deriving a closed-form per-token logit shift, and training under matched Frobenius norm, with the closed form then compared to measured substitution effects at R^2=0.98. No equations or steps are shown that reduce the claimed prediction to the fitted atoms by construction, nor is any load-bearing premise justified solely by self-citation. The reported substitution success rates and behavioral installs are presented as independent empirical outcomes rather than tautological consequences of the training objective. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that cache writes admit a useful sparse decomposition into atoms of matching shape and that the matched-norm training preserves substitution fidelity. No explicit free parameters are named in the abstract, but SAE training inherently involves learned dictionary weights fitted to activations.

free parameters (1)
  • SAE dictionary weights and sparsity targets
    Learned during training on model cache activations under the matched Frobenius objective.
axioms (1)
  • domain assumption Cache writes can be meaningfully represented by sparse atoms of identical d_k x d_v shape
    Required for atom substitution to be a valid operation on the recurrent state.

pith-pipeline@v0.9.0 · 5532 in / 1244 out tokens · 58824 ms · 2026-05-14T20:50:07.700035+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

106 extracted references · 61 canonical work pages · 12 internal anchors

  1. [1]

    Transformer Circuits Thread , year=

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. Transformer Circuits Thread , year=

  2. [2]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. International Conference on Learning Representations , year=. 2309.08600 , eprintclass=

  3. [3]

    Scaling and evaluating sparse autoencoders

    Scaling and Evaluating Sparse Autoencoders , author=. arXiv preprint arXiv:2406.04093 , year=. 2406.04093 , eprintclass=

  4. [4]

    and McDougall, Callum and MacDiarmid, Monte and Freeman, C

    Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and Cunningham, Hoagy and Turner, Nicholas L. and McDougall, Callum and MacDiarmid, Monte and Freeman, C. Daniel and Sumers, Theodore R. and Rees, Edward and Batson, Joshua and...

  5. [5]

    Improving dictionary learning with gated sparse autoen- coders.arXiv preprint arXiv:2404.16014,

    Improving Dictionary Learning with Gated Sparse Autoencoders , author=. arXiv preprint arXiv:2404.16014 , year=. 2404.16014 , eprintclass=

  6. [6]

    Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders.arXiv preprint arXiv:2407.14435,

    Rajamanoharan, Senthooran and Lieberum, Tom and Sonnerat, Nicolas and Conmy, Arthur and Varma, Vikrant and Kram. Jumping Ahead: Improving Reconstruction Fidelity with. arXiv preprint arXiv:2407.14435 , year=. 2407.14435 , eprintclass=

  7. [7]

    and Dooms, Thomas and Rigg, Alice and Oramas, Jose M

    Pearce, Michael T. and Dooms, Thomas and Rigg, Alice and Oramas, Jose M. and Sharkey, Lee , year=. Bilinear. doi:10.48550/arxiv.2410.08417 , url=. 2410.08417 , eprintclass=

  8. [8]

    2025 , month=

    Tracing Attention Computation Through Feature Interactions , author=. 2025 , month=

  9. [9]

    2025 , month=

    On the Biology of a Large Language Model , author=. 2025 , month=

  10. [10]

    2025 , month=

    Circuit Tracing: Revealing Computational Graphs in Language Models , author=. 2025 , month=

  11. [11]

    Advances in Neural Information Processing Systems , year=

    Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. Advances in Neural Information Processing Systems , year=

  12. [12]

    arXiv preprint arXiv:2403.00745 , year=

    Kram. arXiv preprint arXiv:2403.00745 , year=. 2403.00745 , eprintclass=

  13. [13]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author=. International Conference on Learning Representations , year=. 2403.19647 , eprintclass=

  14. [14]

    Conference on Causal Learning and Reasoning (

    Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , author=. Conference on Causal Learning and Reasoning (. 2024 , eprint=

  15. [15]

    The Hidden Attention of

    Ali, Ameen and Zimerman, Itamar and Wolf, Lior , booktitle=. The Hidden Attention of. 2025 , eprint=

  16. [16]

    arXiv preprint arXiv:2404.05971 , year=

    Does Transformer Interpretability Transfer to RNNs? , author=. arXiv preprint arXiv:2404.05971 , year=. 2404.05971 , eprintclass=

  17. [17]

    and Jagadeesan, Ganesh and Singh, Sameer and Tetreault, Joel and Jaimes, Alejandro , booktitle=

    Hossain, Tamanna and Logan IV, Robert L. and Jagadeesan, Ganesh and Singh, Sameer and Tetreault, Joel and Jaimes, Alejandro , booktitle=. Characterizing. 2025 , note=

  18. [18]

    arXiv preprint arXiv:2410.06672 , year=

    Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures , author=. arXiv preprint arXiv:2410.06672 , year=. 2410.06672 , eprintclass=

  19. [19]

    doi:10.48550/arxiv.2505.24244 , url=

    Endy, Nir and Grosbard, Idan Daniel and Ran-Milo, Yuval and Slutzky, Yonatan and Tshuva, Itay and Giryes, Raja , year=. doi:10.48550/arxiv.2505.24244 , url=. 2505.24244 , eprintclass=

  20. [20]

    Investigating the Indirect Object Identification Circuit in

    Ensign, Danielle and Garriga-Alonso, Adri. Investigating the Indirect Object Identification Circuit in. 2024 , eprint=. doi:10.48550/arxiv.2407.14008 , url=

  21. [21]

    arXiv preprint arXiv:2406.17759 , year=

    Interpreting Attention Layer Outputs with Sparse Autoencoders , author=. arXiv preprint arXiv:2406.17759 , year=. 2406.17759 , eprintclass=

  22. [22]

    2025 , eprint=

    Karvonen, Adam and Rager, Can and Lin, Johnny and Tigges, Curt and Bloom, Joseph and Chanin, David and Lau, Yeu-Tong and Farrell, Eoin and McDougall, Callum and Ayonrinde, Kola and Till, Demian and Wearden, Matthew and Conmy, Arthur and Marks, Samuel and Nanda, Neel , booktitle=. 2025 , eprint=

  23. [23]

    2025 , eprint=

    Kurochkin, Vadim and Aksenov, Yaroslav and Laptev, Daniil and Gavrilov, Daniil and Balagansky, Nikita , journal=. 2025 , eprint=

  24. [24]

    arXiv preprint arXiv:2510.16820 , year=

    Finding Manifolds With Bilinear Autoencoders , author=. arXiv preprint arXiv:2510.16820 , year=. 2510.16820 , eprintclass=

  25. [25]

    and Oldfield, James and Panagakis, Yannis and Nicolaou, Mihalis A

    Koromilas, Panagiotis and Demou, Andreas D. and Oldfield, James and Panagakis, Yannis and Nicolaou, Mihalis A. , title=. 2026 , archivePrefix=. 2602.01322 , primaryClass=

  26. [26]

    arXiv preprint arXiv:2602.22719 , year=

    Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks , author=. arXiv preprint arXiv:2602.22719 , year=. 2602.22719 , eprintclass=

  27. [27]

    Behavioral Steering in a 35

    Yap, Jia Qing , journal=. Behavioral Steering in a 35. 2026 , eprint=

  28. [28]

    International Conference on Machine Learning (ICML) , year=

    Linear Transformers Are Secretly Fast Weight Programmers , author=. International Conference on Machine Learning (ICML) , year=. 2102.11174 , eprintclass=

  29. [29]

    and Chen, Berlin and Wang, Caitlin and Bick, Aviv and Kolter, J

    Lahoti, Aakash and Li, Kevin Y. and Chen, Berlin and Wang, Caitlin and Bick, Aviv and Kolter, J. Zico and Dao, Tri and Gu, Albert , booktitle=. 2026 , eprint=

  30. [30]

    International Conference on Learning Representations (ICLR) , year=

    Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition , author=. International Conference on Learning Representations (ICLR) , year=. 2504.20938 , eprintclass=

  31. [31]

    Transformers are

    Dao, Tri and Gu, Albert , journal=. Transformers are. 2024 , eprint=

  32. [32]

    Gated Delta Networks: Improving

    Yang, Songlin and Kautz, Jan and Hatamizadeh, Ali , booktitle=. Gated Delta Networks: Improving. 2025 , eprint=

  33. [33]

    arXiv preprint arXiv:2312.06635 , year=

    Gated Linear Attention Transformers with Hardware-Efficient Training , author=. arXiv preprint arXiv:2312.06635 , year=. 2312.06635 , eprintclass=

  34. [34]

    Advances in Neural Information Processing Systems , pages=

    Parallelizing Linear Transformers with the Delta Rule over Sequence Length , author=. Advances in Neural Information Processing Systems , pages=. 2024 , doi=. 2406.06484 , eprintclass=

  35. [35]

    and Wu, Tianyi and Wuttke, Daniel and Zhou-Zheng, Christian , booktitle=

    Peng, Bo and Zhang, Ruichong and Goldstein, Daniel and Alcaide, Eric and Du, Xingjian and Hou, Haowen and Lin, Jiaju and Liu, Jiaxing and Lu, Janna and Merrill, William and Song, Guangyu and Tan, Kaifeng and Utpala, Saiteja and Wilce, Nathan and Wind, Johan S. and Wu, Tianyi and Wuttke, Daniel and Zhou-Zheng, Christian , booktitle=. 2025 , eprint=

  36. [36]

    Titans: Learning to Memorize at Test Time

    Titans: Learning to Memorize at Test Time , author=. arXiv preprint arXiv:2501.00663 , year=. 2501.00663 , eprintclass=

  37. [37]

    Comba: Improving Bilinear

    Hu, Jiaxi and Pan, Yongqi and Du, Jusen and Lan, Disen and Tang, Xiaqiang and Wen, Qingsong and Liang, Yuxuan and Sun, Weigao , year=. Comba: Improving Bilinear. doi:10.48550/arxiv.2506.02475 , url=. 2506.02475 , eprintclass=

  38. [38]

    Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2.arXiv preprint arXiv:2408.05147,

    Lieberum, Tom and Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Sonnerat, Nicolas and Varma, Vikrant and Kram. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on. arXiv preprint arXiv:2408.05147 , year=. 2408.05147 , eprintclass=

  39. [39]

    arXiv preprint arXiv:2405.14860 , year=

    Not All Language Model Features Are One-Dimensionally Linear , author=. arXiv preprint arXiv:2405.14860 , year=. 2405.14860 , eprintclass=

  40. [40]

    Transcoders Find Interpretable

    Dunefsky, Jacob and Chlenski, Philippe and Nanda, Neel , journal=. Transcoders Find Interpretable. 2024 , eprint=

  41. [41]

    In-context Learning and Induction Heads

    In-context Learning and Induction Heads , author=. Transformer Circuits Thread , year=. 2209.11895 , eprintclass=

  42. [42]

    Locating and Editing Factual Associations in

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , journal=. Locating and Editing Factual Associations in

  43. [43]

    Interpretability in the Wild: A Circuit for Indirect Object Identification in

    Wang, Kevin Ro and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , booktitle=. Interpretability in the Wild: A Circuit for Indirect Object Identification in

  44. [44]

    CoRR , volume =

    Attribution Patching Outperforms Automated Circuit Discovery , author=. arXiv preprint arXiv:2310.10348 , year=. 2310.10348 , eprintclass=

  45. [45]

    Locating and Editing Factual Associations in

    Sharma, Arnab Sen and Atkinson, David and Bau, David , booktitle=. Locating and Editing Factual Associations in. 2024 , eprint=

  46. [46]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Short Papers) , year =

    Kang, Wonjun and Galim, Kevin and Zeng, Yuchen and Lee, Minjae and Koo, Hyung Il and Cho, Nam Ik , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Short Papers) , year =. doi:10.18653/v1/2025.acl-short.36 , eprint =

  47. [47]

    Vision Transformers Need Registers

    Vision Transformers Need Registers , author=. 2023 , eprint=. doi:10.48550/arxiv.2309.16588 , url=

  48. [48]

    2025 , doi=

    Wang, Feng and Wang, Jiahao and Ren, Sucheng and Wei, Guoyizhe and Mei, Jieru and Shao, Wei and Zhou, Yuyin and Yuille, Alan and Xie, Cihang , booktitle=. 2025 , doi=

  49. [49]

    2024 , eprint=

    A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders , author=. 2024 , eprint=. doi:10.48550/arxiv.2409.14507 , url=

  50. [50]

    Universal Neurons in

    Gurnee, Wes and Horsley, Theo and Guo, Zifan Carl and Kheirkhah, Tara Rezaei and Sun, Qinyi and Hathaway, Will and Nanda, Neel and Bertsimas, Dimitris , year=. Universal Neurons in. doi:10.48550/arxiv.2401.12181 , url=. 2401.12181 , eprintclass=

  51. [51]

    doi:10.48550/arxiv.2510.00404 , url=

    Zhu, Xudong and Khalili, Mohammad Mahdi and Zhu, Zhihui , year=. doi:10.48550/arxiv.2510.00404 , url=. 2510.00404 , eprintclass=

  52. [52]

    2025 , eprint=

    Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders , author=. 2025 , eprint=. doi:10.48550/arxiv.2511.09432 , url=

  53. [53]

    Transformer Circuits Thread , year=

    Sparse Crosscoders for Cross-Layer Features and Model Diffing , author=. Transformer Circuits Thread , year=

  54. [54]

    2025 , eprint=

    Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders , author=. 2025 , eprint=. doi:10.48550/arxiv.2512.08892 , url=

  55. [55]

    2026 , eprint=

    Deng, Boyi and Wan, Yu and Yang, Baosong and Huang, Fei and Wang, Wenjie and Feng, Fuli , booktitle=. 2026 , eprint=

  56. [56]

    and Potts, Christopher , year=

    Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Wang, Zheng and Huang, Jing and Jurafsky, Dan and Manning, Christopher D. and Potts, Christopher , year=

  57. [57]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. 2023 , archivePrefix=. 2312.00752 , primaryClass=

  58. [58]

    2024 , archivePrefix=

    Transformers Represent Belief State Geometry in Their Forward Pass , author=. 2024 , archivePrefix=. 2405.15943 , primaryClass=

  59. [59]

    2025 , archivePrefix=

    Learning Multi-Level Features with Matryoshka Sparse Autoencoders , author=. 2025 , archivePrefix=. 2503.17547 , primaryClass=

  60. [60]

    Bussmann, Bart and Leask, Patrick and Nanda, Neel , year=

  61. [61]

    Localizing Model Behavior with Path Patching , journal =

    Localizing Model Behavior with Path Patching , author=. 2023 , archivePrefix=. 2304.05969 , primaryClass=

  62. [62]

    Transformers are

    Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. International Conference on Machine Learning (ICML) , pages=. 2020 , archivePrefix=. 2006.16236 , primaryClass=

  63. [63]

    Neural Computation , volume=

    Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , author=. Neural Computation , volume=. 1992 , doi=

  64. [64]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Using Fast Weights to Attend to the Recent Past , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  65. [65]

    Transformer Circuits Thread , year=

    A Mathematical Framework for Transformer Circuits , author=. Transformer Circuits Thread , year=

  66. [66]

    Nature , volume=

    Emergence of simple-cell receptive field properties by learning a sparse code for natural images , author=. Nature , volume=. 1996 , doi=

  67. [67]

    Learning to (learn at test time): Rnns with expressive hidden states

    Sun, Yu and Li, Xinhao and Dalal, Karan and Xu, Jiarui and Vikram, Arjun and Zhang, Genghan and Dubois, Yann and Chen, Xinlei and Wang, Xiaolong and Koyejo, Sanmi and Hashimoto, Tatsunori and Guestrin, Carlos , journal=. Learning to (Learn at Test Time):. 2024 , archivePrefix=. 2407.04620 , primaryClass=

  68. [68]

    Open Problems in Mechanistic Interpretability

    Open Problems in Mechanistic Interpretability , author=. 2025 , archivePrefix=. 2501.16496 , primaryClass=

  69. [69]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  70. [70]

    Steering Language Models With Activation Engineering

    Steering Language Models With Activation Engineering , author=. 2023 , archivePrefix=. 2308.10248 , primaryClass=

  71. [71]

    Findings of ACL , year=

    Extracting Latent Steering Vectors from Pretrained Language Models , author=. Findings of ACL , year=

  72. [72]

    ACL , year=

    Steering Llama 2 via Contrastive Activation Addition , author=. ACL , year=

  73. [73]

    2019 , howpublished =

    Gokaslan, Aaron and Cohen, Vanya , title =. 2019 , howpublished =

  74. [74]

    Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and Zheng, Chujie and Liu, Dayiheng and Zhou, Fan and Huang, Fei and Hu, Feng and Ge, Hao and Wei, Haoran and Lin, Huan and Tang, Jialong and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jia...

  75. [75]

    2024 , howpublished=

    Flash Linear Attention , author=. 2024 , howpublished=

  76. [76]

    2026 , archivePrefix=

    The Key to State Reduction in Linear Attention: A Rank-based Perspective , author=. 2026 , archivePrefix=. 2602.04852 , primaryClass=

  77. [77]

    2025 , archivePrefix =

    Sun, Xiaoqing and Stolfo, Alessandro and Engels, Joshua and Wu, Ben and Rajamanoharan, Senthooran and Sachan, Mrinmaya and Tegmark, Max , title =. 2025 , archivePrefix =. 2506.15679 , primaryClass =

  78. [78]

    Sparse Autoencoders Trained on the Same Data Learn Different Features , year =

    Paulo, Gon. Sparse Autoencoders Trained on the Same Data Learn Different Features , year =

  79. [79]

    2026 , archivePrefix =

    Jiralerspong, Thomas and Bricken, Trenton , title =. 2026 , archivePrefix =. 2602.11729 , primaryClass =

  80. [80]

    2024 , archivePrefix =

    Lan, Michael and Torr, Philip and Meek, Austin and Khakzar, Ashkan and Krueger, David and Barez, Fazl , title =. 2024 , archivePrefix =. 2410.06981 , primaryClass =

Showing first 80 references.