VASAE: Naming SAE Dictionary Directions with Vocabulary-Aligned Anchoring

Kairui Zhang; Martha Lewis; Zahraa S. Abdallah; Ziwen Yu

arxiv: 2606.27941 · v1 · pith:GBMTPDMWnew · submitted 2026-06-26 · 💻 cs.CL · cs.AI· cs.LG

VASAE: Naming SAE Dictionary Directions with Vocabulary-Aligned Anchoring

Kairui Zhang , Ziwen Yu , Zahraa S. Abdallah , Martha Lewis This is my paper

Pith reviewed 2026-06-29 04:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords sparse autoencodersSAEvocabulary alignmentfeature interpretationtransformer modelsGPT-2Llama

0 comments

The pith

VASAE trains sparse autoencoders to assign each feature the name of its nearest vocabulary token while matching standard reconstruction quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Vocabulary-Aligned Sparse Autoencoder (VASAE) that anchors SAE training to the token vocabulary of the underlying model. Each learned feature direction receives an intrinsic name from the token embedding closest to it. This produces high alignment rates, around 90 percent of features in GPT-2 layers 0-10 at a 0.8 cutoff, and strong results in shallow and middle layers of Llama-3.1-8B, all without lowering reconstruction performance relative to ordinary SAEs. After mean subtraction on sparse codes, many assigned names remain relevant to nearby input tokens.

Core claim

VASAE trains SAE features under vocabulary-aligned anchoring and assigns each feature an intrinsic token name: the token string whose embedding is nearest to that feature. Without reducing reconstruction quality compared with a standard SAE, VASAE produces dictionaries with vocabulary-aligned features, achieving about 90 percent alignment in layers 0-10 of GPT-2-small at a 0.8 cutoff on the nearest-token alignment score, high alignment in representative shallow and middle layers of Llama-3.1-8B, and limited alignment in the final layer.

What carries the argument

Vocabulary-aligned anchoring, which constrains SAE feature directions during training so each is named by its nearest token embedding.

If this is right

Dictionaries gain intrinsic token names usable directly from training rather than post-hoc labeling.
Alignment reaches 92.8 percent in a representative shallow layer of Llama-3.1-8B.
After subtracting sentence-level mean sparse codes, many intrinsic names remain relevant to nearby input tokens.
The approach works across GPT-2-small and Llama-3.1-8B but shows layer-dependent variation in final layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The anchoring may reduce the need for manual feature inspection in larger models by providing built-in labels.
Testing whether these names predict feature behavior on held-out inputs could quantify their stability beyond the reported cutoffs.
The method might extend to other activation sites such as attention outputs or MLP layers to check consistency of alignment.

Load-bearing premise

The token whose embedding is nearest to a learned SAE feature direction provides a semantically meaningful and stable name for the feature's activation behavior across inputs.

What would settle it

A dataset of inputs where the activation pattern of a feature consistently fails to match the semantics of its assigned nearest token would falsify the claim that the name is meaningful.

Figures

Figures reproduced from arXiv: 2606.27941 by Kairui Zhang, Martha Lewis, Zahraa S. Abdallah, Ziwen Yu.

**Figure 1.** Figure 1: Transformer residual-stream decomposition with VASAE. The model learns the dictionary WD under a reconstruction objective plus the vocabulary-anchor loss Lanchor. 3.1. VASAE Architecture For each model layer, we train a separate VASAE model on the post-residual stream from that layer. The input is h ∈ R dmodel , and the encoder is the same linear ReLU top-k map as in Equation 4. This produces a nonnegativ… view at source ↗

**Figure 2.** Figure 2: Geometric alignment diagnostics. Left: GPT-2-small alignment score distributions for representative even-numbered layers from L0 to L10. Right: Llama-3.1-8B alignment score distributions at λanchor = 5 × 10−3 for L0, L15, and L31. The dashed line marks the diagnostic strong-alignment threshold si = 0.8. 4.4. Case Studies of Intrinsic Token Names Case-study setup. We next inspect the intrinsic token names a… view at source ↗

**Figure 4.** Figure 4: Llama-3.1-8B case study at λanchor = 5×10−3 . Each cell shows the intrinsic token name of the feature chosen by the same sentence-centered sparse-code rule as [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 3.** Figure 3: GPT-2-small location case study. Each cell shows the intrinsic token name of the aligned feature with the largest sentencecentered sparse-code value at a layer and input position. Color indicates the raw sparse-code activation of that feature. Shallow and middle layers show location-related intrinsic token names. GPT-2-small example [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Layerwise benchmark diagnostics for GPT-2-small and Llama-3.1-8B. Top: VE. Middle: CE recovery. Bottom: LogitLens top-1 agreement under direct unembedding. VASAE and the standard SAE nearly overlap in VE and CE recovery across layers in both model families, matching the aggregate benchmark. This supports reconstruction preservation in the tested cases; trade-offs in other settings remain empirical. The ha… view at source ↗

**Figure 6.** Figure 6: Additional GPT-2-small adjective/adverb example. Each cell shows the intrinsic token name of the feature chosen by the same sentence-centered sparse-code rule as [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Additional qualitative GPT-2-small intrinsic-name examples. Each cell shows the intrinsic token name of the feature chosen by the same sentence-centered sparse-code rule as [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Sparse autoencoders (SAEs) provide useful decompositions of Transformer residual streams, but their learned features are usually named post hoc rather than directly connected to the Transformer's token vocabulary. We introduce Vocabulary-Aligned Sparse Autoencoder (VASAE), a method that trains SAE features under vocabulary-aligned anchoring and assigns each feature an intrinsic token name: the token string whose embedding is nearest to that feature. Without reducing reconstruction quality compared with a standard SAE, VASAE produces dictionaries with vocabulary-aligned features. Using a 0.8 cutoff on the nearest-token alignment score, dictionaries trained on GPT-2-small post-residual streams align about 90% of features in layers 0--10. In Llama-3.1-8B, representative shallow and middle-layer dictionaries contain strongly aligned features, including 92.8% in the shallow layer, while the representative final-layer dictionary shows limited alignment. After subtracting the sentence-level mean sparse code, case studies show that many remaining intrinsic token names are relevant to nearby input tokens. These results suggest that vocabulary-aligned anchoring can connect learned features to intrinsic token names during training, complementing post hoc interpretation of learned dictionaries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VASAE adds a training-time constraint to push SAE decoder directions toward token embeddings for automatic naming, but provides no test that those names match actual feature activations.

read the letter

The core move here is training SAEs with an explicit vocabulary-alignment term so that each learned direction ends up close to some token embedding and can be named by it. That is a shift from the usual post-hoc labeling, and the abstract reports it works without hurting reconstruction on GPT-2 small and Llama-3.1-8B, with roughly 90% of early-layer features meeting an 0.8 alignment threshold.

What the work actually shows is limited to those alignment percentages and a couple of qualitative case studies after mean subtraction. No tables compare reconstruction loss or sparsity against a matched baseline SAE, no error bars appear, and no ablation removes the anchoring term to measure its effect. The methods section is not visible in the supplied material, so the exact loss, the weighting of the alignment term, and the training details remain unknown.

The central assumption—that nearest-neighbor distance in embedding space yields a name whose semantics track the feature’s input selectivity—receives no direct test. There is no reported precision or recall of the named token’s activations, no check of name stability under small direction perturbations, and no comparison against human labels or activation-maximizing examples. If the anchoring only rotates directions without constraining the encoder, high alignment scores could coexist with names that are incidental.

This is incremental work inside the existing SAE program rather than a reorganization of it. Readers already running SAEs on transformers might want the code and the exact training recipe to try the constraint themselves. It is not yet strong enough to stand on its own as a finished result.

I would bring it to a reading group for the idea, but not cite it until the quantitative checks are in place. A serious editor could send it to review once the full methods and the missing activation tests are supplied; the current version is too thin for that.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Vocabulary-Aligned Sparse Autoencoder (VASAE), a modified SAE training procedure that incorporates vocabulary-aligned anchoring so that each decoder direction can be named by its nearest token embedding. The central claims are that this produces dictionaries with high vocabulary alignment (approximately 90% of features above 0.8 alignment score in GPT-2-small layers 0-10 and 92.8% in a shallow Llama-3.1-8B layer) without reducing reconstruction quality relative to standard SAEs, with qualitative support from case studies after sentence-level mean subtraction of sparse codes.

Significance. If the alignment scores correspond to stable semantic matches between names and activation patterns, the method would provide a training-time solution to the post-hoc naming problem in SAE interpretability, potentially yielding more reproducible and vocabulary-grounded feature dictionaries. The reported layer-wise differences (strong alignment in shallow/middle layers, weaker in final layers) and the mean-subtraction case studies are potentially useful observations for the field.

major comments (3)

[Abstract] Abstract: the claim that VASAE produces dictionaries 'without reducing reconstruction quality compared with a standard SAE' is load-bearing for the central contribution but is stated without any quantitative comparison (no MSE values, no tables, no error bars, no ablation on the anchoring term).
[Abstract] Abstract: the utility of the intrinsic names rests on the untested premise that nearest-token embedding proximity yields a name whose semantics match the feature's activation behavior; no quantitative test (e.g., activation precision/recall on the named token, or stability under direction perturbation) is reported to support the ~90% alignment figures.
[Abstract] Abstract and results: the alignment percentages (90% for GPT-2 layers 0-10, 92.8% for Llama shallow layer) are presented without accompanying tables, per-layer breakdowns, or controls for the effect of the anchoring hyperparameter, making it impossible to assess whether the result is robust or an artifact of the chosen 0.8 cutoff.

minor comments (1)

[Abstract] The abstract refers to 'case studies' showing relevance after mean subtraction but does not specify the criteria used to judge relevance or the number of examples examined.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the abstract and results sections would benefit from additional quantitative details, tables, and robustness checks. We will revise the manuscript accordingly to address these points.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that VASAE produces dictionaries 'without reducing reconstruction quality compared with a standard SAE' is load-bearing for the central contribution but is stated without any quantitative comparison (no MSE values, no tables, no error bars, no ablation on the anchoring term).

Authors: We agree that the abstract would be strengthened by including specific quantitative evidence. The full manuscript reports MSE values demonstrating comparable reconstruction quality to standard SAEs, but we will add explicit MSE figures with error bars from multiple random seeds, a summary table, and an ablation on the anchoring term strength to the abstract and main results. revision: yes
Referee: [Abstract] Abstract: the utility of the intrinsic names rests on the untested premise that nearest-token embedding proximity yields a name whose semantics match the feature's activation behavior; no quantitative test (e.g., activation precision/recall on the named token, or stability under direction perturbation) is reported to support the ~90% alignment figures.

Authors: The alignment score is defined directly as the cosine similarity between the decoder direction and the nearest token embedding, providing an intrinsic measure of vocabulary alignment by construction. We acknowledge that this does not automatically guarantee semantic match with activation patterns on inputs. We will add quantitative validation including activation precision/recall for the named tokens and stability analysis under small perturbations to the revised manuscript. revision: yes
Referee: [Abstract] Abstract and results: the alignment percentages (90% for GPT-2 layers 0-10, 92.8% for Llama shallow layer) are presented without accompanying tables, per-layer breakdowns, or controls for the effect of the anchoring hyperparameter, making it impossible to assess whether the result is robust or an artifact of the chosen 0.8 cutoff.

Authors: We will expand the results section with a table providing per-layer alignment percentages for GPT-2 (including all layers), report alignment distributions at multiple cutoffs (e.g., 0.7, 0.8, 0.9), and include an ablation varying the anchoring hyperparameter to confirm that the reported high alignment rates are robust rather than sensitive to the specific threshold or hyperparameter choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces VASAE as an empirical training modification to standard SAEs that adds vocabulary-aligned anchoring during optimization, then reports measured alignment percentages (e.g., ~90% above 0.8 cutoff in early GPT-2 layers) and reconstruction quality comparisons. No equations, derivations, or predictions are presented that reduce to their own inputs by construction; the claims rest on the described procedure and observed outcomes on held-out data rather than self-referential definitions or fitted parameters renamed as independent results. No self-citations or uniqueness theorems are invoked as load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full methods, equations, and experimental details unavailable, limiting enumeration of parameters and assumptions.

axioms (1)

domain assumption Nearest embedding in vocabulary space supplies a meaningful intrinsic name for each SAE feature
This premise is required to interpret the alignment scores and intrinsic names reported in the abstract.

pith-pipeline@v0.9.1-grok · 5751 in / 1181 out tokens · 27878 ms · 2026-06-29T04:42:53.130340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Belrose, N., Ostrovsky, I., McKinney, L., Furman, Z., Smith, L., Halawi, D., Biderman, S., and Steinhardt, J

doi: 10.1162/TACL_A_ 00254. Belrose, N., Ostrovsky, I., McKinney, L., Furman, Z., Smith, L., Halawi, D., Biderman, S., and Steinhardt, J. Eliciting Latent Predictions from Transformers with the Tuned Lens, November

work page doi:10.1162/tacl_a_
[2]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford,...

2020
[3]

How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings

Ethayarajh, K. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. In Inui, K., Jiang, J., Ng, V ., and Wan, X. (eds.),Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP...

2019
[4]

findings-emnlp.148/

doi: 10.18653/V1/D19-1006. Fiotto-Kaufman, J. F., Loftus, A. R., Todd, E., Brinkmann, J., Pal, K., Troitskii, D., Ripa, M., Belfki, A., Rager, C., Juang, C., Mueller, A., Marks, S., Sharma, A. S., Luc- chetti, F., Prakash, N., Brodley, C. E., Guha, A., Bell, J., Wallace, B. C., and Bau, D. NNsight and NDIF: De- mocratizing Access to Open-Weight Foundation...

work page doi:10.18653/v1/d19-1006 2025
[5]

D., Tillman, H., Goh, G., Troll, R., Rad- ford, A., Sutskever, I., Leike, J., and Wu, J

Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Rad- ford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. InThe Thirteenth Inter- national Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

2025
[6]

Trans- former Feed-Forward Layers Are Key-Value Memories

Geva, M., Schuster, R., Berant, J., and Levy, O. Trans- former Feed-Forward Layers Are Key-Value Memories. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.),Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Domini- can Republic, 7-11 November, 2021, pp. 5484–5495....

2021
[7]

Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L., and Geva, M

doi: 10.18653/V1/2021.EMNLP-MAIN.446. Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L., and Geva, M. Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Mod- els. In Salakhutdinov, R., Kolter, Z., Heller, K. A., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.),Forty-First International Conference on Ma...

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021
[8]

and Liang, P

Hewitt, J. and Liang, P. Designing and Interpreting Probes with Control Tasks. In Inui, K., Jiang, J., Ng, V ., and Wan, X. (eds.),Proceedings of the 2019 Confer- ence on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp....

2019
[9]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D

doi: 10.18653/V1/D19-1275. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J. W., and Sifre, L. An empirical analysis of compu...

work page doi:10.18653/v1/d19-1275 2022
[10]

R., Ewart, A., and Sharkey, L

Huben, R., Cunningham, H., Smith, L. R., Ewart, A., and Sharkey, L. Sparse Autoencoders Find Highly Inter- pretable Features in Language Models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

2024
[11]

Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

Inan, H., Khosravi, K., and Socher, R. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net,

2017
[12]

I., Chanin, D., Lau, Y ., Farrell, E., McDougall, C., Ayonrinde, K., Till, D., Wearden, M., Conmy, A., Marks, S., and Nanda, N

Karvonen, A., Rager, C., Lin, J., Tigges, C., Bloom, J. I., Chanin, D., Lau, Y ., Farrell, E., McDougall, C., Ayonrinde, K., Till, D., Wearden, M., Conmy, A., Marks, S., and Nanda, N. Saebench: A comprehen- sive benchmark for sparse autoencoders in language model interpretability. In Singh, A., Fazel, M., Hsu, D., Lacoste-Julien, S., Berkenkamp, F., Mahar...

2025
[13]

Makhzani, A

URL https://proceedings.mlr.press/ v267/karvonen25a.html. Makhzani, A. and Frey, B. J. K-Sparse Autoencoders. In Bengio, Y . and LeCun, Y . (eds.),2nd International Con- ference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Pro- ceedings,

2014
[14]

J., Belinkov, Y ., Bau, D., and Mueller, A

Marks, S., Rager, C., Michaud, E. J., Belinkov, Y ., Bau, D., and Mueller, A. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

2025
[15]

Pointer Sentinel Mixture Models

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer Sentinel Mixture Models. In5th International Confer- ence on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceed- ings. OpenReview.net,

2017
[16]

and Viswanath, P

Mu, J. and Viswanath, P. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. In6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,

2018
[17]

Automati- cally Interpreting Millions of Features in Large Language Models

Paulo, G., Mallen, A., Juang, C., and Belrose, N. Automati- cally Interpreting Millions of Features in Large Language Models. In Singh, A., Fazel, M., Hsu, D., Lacoste-Julien, S., Berkenkamp, F., Maharaj, T., Wagstaff, K., and Zhu, J. (eds.),Forty-Second International Conference on Ma- chine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025, Pr...

2025
[18]

and Wolf, L

Press, O. and Wolf, L. Using the Output Embedding to Improve Language Models. In Lapata, M., Blunsom, P., and Koller, A. (eds.),Proceedings of the 15th Con- ference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pp. 157–163. Association for Computational Linguistics,

2017
[19]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al

doi: 10.18653/V1/E17-2025. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9,

work page doi:10.18653/v1/e17-2025 2025
[20]

BERT Rediscovers the Classical NLP Pipeline

Tenney, I., Das, D., and Pavlick, E. BERT Rediscovers the Classical NLP Pipeline. In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.),Proceedings of the 57th Conference of the Association for Computational Lin- guistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 4593–4601. As- sociation for Computational Linguistics,

2019
[21]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A

doi: 10.18653/V1/P19-1452. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is All you Need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V . N., and Garnett, R. (eds.),Advances in Neural Information Processing Systems 30: Annual Conference on ...

work page doi:10.18653/v1/p19-1452 2017
[22]

The average next-token cross-entropy is CEr =− 1 M MX i=1 logp (r) i (yi).(16) Here r= SAE uses ˜h, r= Id uses the original h, and r= 0uses0

For substitution r∈ {SAE,Id,0} , let p(r) i be the resulting next-token distribution at evaluated position i, and let yi be the target next token. The average next-token cross-entropy is CEr =− 1 M MX i=1 logp (r) i (yi).(16) Here r= SAE uses ˜h, r= Id uses the original h, and r= 0uses0. The CE loss reported in Table 1 isCE SAE. CE rec.We compute CE recov...

arXiv
[23]

The examples show displayed names around award-related and self- introduction contexts

Color indicates the raw sparse-code activation of that feature. The examples show displayed names around award-related and self- introduction contexts. F .Implementation Details for Similarity Score Computing the similarity scores {si}dsparse i=1 defined in Equa- tion 7 independently for each feature requires a nested loop over the vocabulary size |V| and...

2048

[1] [1]

Belrose, N., Ostrovsky, I., McKinney, L., Furman, Z., Smith, L., Halawi, D., Biderman, S., and Steinhardt, J

doi: 10.1162/TACL_A_ 00254. Belrose, N., Ostrovsky, I., McKinney, L., Furman, Z., Smith, L., Halawi, D., Biderman, S., and Steinhardt, J. Eliciting Latent Predictions from Transformers with the Tuned Lens, November

work page doi:10.1162/tacl_a_

[2] [2]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford,...

2020

[3] [3]

How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings

Ethayarajh, K. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. In Inui, K., Jiang, J., Ng, V ., and Wan, X. (eds.),Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP...

2019

[4] [4]

findings-emnlp.148/

doi: 10.18653/V1/D19-1006. Fiotto-Kaufman, J. F., Loftus, A. R., Todd, E., Brinkmann, J., Pal, K., Troitskii, D., Ripa, M., Belfki, A., Rager, C., Juang, C., Mueller, A., Marks, S., Sharma, A. S., Luc- chetti, F., Prakash, N., Brodley, C. E., Guha, A., Bell, J., Wallace, B. C., and Bau, D. NNsight and NDIF: De- mocratizing Access to Open-Weight Foundation...

work page doi:10.18653/v1/d19-1006 2025

[5] [5]

D., Tillman, H., Goh, G., Troll, R., Rad- ford, A., Sutskever, I., Leike, J., and Wu, J

Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Rad- ford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. InThe Thirteenth Inter- national Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

2025

[6] [6]

Trans- former Feed-Forward Layers Are Key-Value Memories

Geva, M., Schuster, R., Berant, J., and Levy, O. Trans- former Feed-Forward Layers Are Key-Value Memories. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.),Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Domini- can Republic, 7-11 November, 2021, pp. 5484–5495....

2021

[7] [7]

Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L., and Geva, M

doi: 10.18653/V1/2021.EMNLP-MAIN.446. Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L., and Geva, M. Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Mod- els. In Salakhutdinov, R., Kolter, Z., Heller, K. A., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.),Forty-First International Conference on Ma...

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021

[8] [8]

and Liang, P

Hewitt, J. and Liang, P. Designing and Interpreting Probes with Control Tasks. In Inui, K., Jiang, J., Ng, V ., and Wan, X. (eds.),Proceedings of the 2019 Confer- ence on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp....

2019

[9] [9]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D

doi: 10.18653/V1/D19-1275. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J. W., and Sifre, L. An empirical analysis of compu...

work page doi:10.18653/v1/d19-1275 2022

[10] [10]

R., Ewart, A., and Sharkey, L

Huben, R., Cunningham, H., Smith, L. R., Ewart, A., and Sharkey, L. Sparse Autoencoders Find Highly Inter- pretable Features in Language Models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

2024

[11] [11]

Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

Inan, H., Khosravi, K., and Socher, R. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net,

2017

[12] [12]

I., Chanin, D., Lau, Y ., Farrell, E., McDougall, C., Ayonrinde, K., Till, D., Wearden, M., Conmy, A., Marks, S., and Nanda, N

Karvonen, A., Rager, C., Lin, J., Tigges, C., Bloom, J. I., Chanin, D., Lau, Y ., Farrell, E., McDougall, C., Ayonrinde, K., Till, D., Wearden, M., Conmy, A., Marks, S., and Nanda, N. Saebench: A comprehen- sive benchmark for sparse autoencoders in language model interpretability. In Singh, A., Fazel, M., Hsu, D., Lacoste-Julien, S., Berkenkamp, F., Mahar...

2025

[13] [13]

Makhzani, A

URL https://proceedings.mlr.press/ v267/karvonen25a.html. Makhzani, A. and Frey, B. J. K-Sparse Autoencoders. In Bengio, Y . and LeCun, Y . (eds.),2nd International Con- ference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Pro- ceedings,

2014

[14] [14]

J., Belinkov, Y ., Bau, D., and Mueller, A

Marks, S., Rager, C., Michaud, E. J., Belinkov, Y ., Bau, D., and Mueller, A. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

2025

[15] [15]

Pointer Sentinel Mixture Models

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer Sentinel Mixture Models. In5th International Confer- ence on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceed- ings. OpenReview.net,

2017

[16] [16]

and Viswanath, P

Mu, J. and Viswanath, P. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. In6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,

2018

[17] [17]

Automati- cally Interpreting Millions of Features in Large Language Models

Paulo, G., Mallen, A., Juang, C., and Belrose, N. Automati- cally Interpreting Millions of Features in Large Language Models. In Singh, A., Fazel, M., Hsu, D., Lacoste-Julien, S., Berkenkamp, F., Maharaj, T., Wagstaff, K., and Zhu, J. (eds.),Forty-Second International Conference on Ma- chine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025, Pr...

2025

[18] [18]

and Wolf, L

Press, O. and Wolf, L. Using the Output Embedding to Improve Language Models. In Lapata, M., Blunsom, P., and Koller, A. (eds.),Proceedings of the 15th Con- ference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pp. 157–163. Association for Computational Linguistics,

2017

[19] [19]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al

doi: 10.18653/V1/E17-2025. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9,

work page doi:10.18653/v1/e17-2025 2025

[20] [20]

BERT Rediscovers the Classical NLP Pipeline

Tenney, I., Das, D., and Pavlick, E. BERT Rediscovers the Classical NLP Pipeline. In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.),Proceedings of the 57th Conference of the Association for Computational Lin- guistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 4593–4601. As- sociation for Computational Linguistics,

2019

[21] [21]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A

doi: 10.18653/V1/P19-1452. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is All you Need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V . N., and Garnett, R. (eds.),Advances in Neural Information Processing Systems 30: Annual Conference on ...

work page doi:10.18653/v1/p19-1452 2017

[22] [22]

The average next-token cross-entropy is CEr =− 1 M MX i=1 logp (r) i (yi).(16) Here r= SAE uses ˜h, r= Id uses the original h, and r= 0uses0

For substitution r∈ {SAE,Id,0} , let p(r) i be the resulting next-token distribution at evaluated position i, and let yi be the target next token. The average next-token cross-entropy is CEr =− 1 M MX i=1 logp (r) i (yi).(16) Here r= SAE uses ˜h, r= Id uses the original h, and r= 0uses0. The CE loss reported in Table 1 isCE SAE. CE rec.We compute CE recov...

arXiv

[23] [23]

The examples show displayed names around award-related and self- introduction contexts

Color indicates the raw sparse-code activation of that feature. The examples show displayed names around award-related and self- introduction contexts. F .Implementation Details for Similarity Score Computing the similarity scores {si}dsparse i=1 defined in Equa- tion 7 independently for each feature requires a nested loop over the vocabulary size |V| and...

2048