pith. sign in

arxiv: 2606.28548 · v1 · pith:JFXYGTEMnew · submitted 2026-06-26 · 💻 cs.CL · cs.LG

Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution

Pith reviewed 2026-06-30 01:17 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords sparse autoencodersturn-averaged featureslong-context interpretabilityfeature discoveryattribution graphslanguage model activationsconversation turns
0
0 comments X

The pith

Turn-averaged SAEs represent each conversation turn using a fixed number of features by reconstructing the average activation across its tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard sparse autoencoders extract features from individual token activations, so the number of active features grows with context length and long transcripts become hard to study. Turn-averaged SAEs instead learn a fixed set of features to reconstruct the mean activation over all tokens in a single Human or Assistant turn. The paper shows these features capture a turn's high-level characteristics more completely than per-token features when evaluated by an LLM. The method also simplifies common interpretability tasks such as constructing attribution graphs. Overall, the approach aims to make feature-based analysis of language models practical for extended contexts.

Core claim

We introduce turn-averaged SAEs, which represent a single Human or Assistant turn with a fixed number of features by learning to reconstruct the average model activation across the turn. We find that turn-averaged features describe a single turn's high-level characteristics more completely than per-token features when judged by an LLM. We also demonstrate that turn-averaged SAEs greatly simplify common downstream uses of SAEs like attribution graphs. Broadly, turn-averaged SAEs make interpretability techniques practical at long context lengths.

What carries the argument

Turn-averaged SAE trained to reconstruct the average activation vector across tokens in a turn using a fixed number of features independent of turn length.

If this is right

  • The number of active features stays fixed per turn instead of scaling linearly with token count.
  • Features describe high-level turn characteristics more completely according to LLM evaluation.
  • Attribution graphs and similar analyses become simpler to build and interpret for long transcripts.
  • Interpretability methods apply to extended model interactions without the previous scaling barrier.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same averaging idea could apply to other units such as sentences or full dialogues.
  • LLM-based completeness scoring might support automated pipelines for selecting or pruning features.
  • Hybrid models using both turn-averaged and per-token features could combine coarse and fine-grained views.
  • Fixed feature budgets per turn may reduce memory and compute demands when analyzing very long contexts.

Load-bearing premise

Averaging activations across tokens in a turn preserves enough information to recover high-level turn characteristics without significant loss, and that LLM-based judgment provides a valid and unbiased measure of feature completeness.

What would settle it

An experiment where turn-averaged features fail to identify key high-level characteristics that per-token features capture, or where human evaluators rate per-token features as more complete than the LLM judges do.

Figures

Figures reproduced from arXiv: 2606.28548 by Ben Thompson, Harish Kamath, Kevin Der.

Figure 1
Figure 1. Figure 1: 10-way matching discrimination accuracy across five SAE configurations. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pairwise coverage results. Each cell shows the row configuration’s win rate against the [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 5-way ranking: average rank and percentage ranked first across five SAE configurations. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Embedding-based coverage similarity across five SAE configurations (k=3). [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Subgraph of the turn-averaged attribution graph for a 14-turn conversation in which the [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Spearman and Pearson correlation between attribution edge weights and observed interven [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Completeness and replacement sufficiency scores on unpruned attribution graphs from 300 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Sparse autoencoders (SAEs) have become a useful tool for extracting interpretable features in language models. However, standard SAE architectures operate on individual token activations, meaning that the number of active features scales linearly with context length, and studying long model transcripts becomes difficult. We introduce turn-averaged SAEs, which represent a single Human or Assistant turn with a fixed number of features by learning to reconstruct the average model activation across the turn. We find that turn-averaged features describe a single turn's high-level characteristics more completely than per-token features when judged by an LLM. We also demonstrate that turn-averaged SAEs greatly simplify common downstream uses of SAEs like attribution graphs. Broadly, turn-averaged SAEs make interpretability techniques practical at long context lengths.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce turn-averaged SAEs that represent a single Human or Assistant turn with a fixed number of features by reconstructing the average model activation across the turn. It finds that these features describe a single turn's high-level characteristics more completely than per-token features when judged by an LLM, and that they simplify downstream uses like attribution graphs, making interpretability practical at long context lengths.

Significance. This work addresses a scalability issue in SAE-based interpretability for long contexts. If the LLM judgment is shown to be reliable, it could enable more practical feature discovery and attribution in extended conversations. The fixed feature budget per turn is a useful innovation for managing complexity in multi-turn interactions.

major comments (3)
  1. [Results] The claim that turn-averaged features are more complete is based on LLM judgment without any reported validation against human judgments or other metrics, which is load-bearing for the headline result.
  2. [Method] No equations are shown for the SAE training or the averaging process, making it difficult to understand the exact implementation and potential limitations of averaging activations.
  3. [Experiments] There are no quantitative results, error bars, details on training procedure, or controls for confounding factors in the evaluation.
minor comments (2)
  1. [Abstract] The abstract does not specify the LLM used for judgment or the exact prompt used for evaluation.
  2. [Introduction] Some notation for activations and features could be clarified for readers new to SAEs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and describe the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [Results] The claim that turn-averaged features are more complete is based on LLM judgment without any reported validation against human judgments or other metrics, which is load-bearing for the headline result.

    Authors: We acknowledge that the headline claim rests on LLM-as-judge evaluations without reported human validation. While this approach is common in recent interpretability work, we agree it is a substantive limitation. In the revision we will add a human evaluation on a sampled subset of turns, report inter-rater agreement with the LLM judge, and qualify the original claims accordingly. revision: yes

  2. Referee: [Method] No equations are shown for the SAE training or the averaging process, making it difficult to understand the exact implementation and potential limitations of averaging activations.

    Authors: We agree that the lack of explicit equations reduces clarity and reproducibility. The revised manuscript will include the formal definition of the turn-averaging operator, the SAE training objective, and a short discussion of limitations introduced by averaging (e.g., loss of token-level resolution). revision: yes

  3. Referee: [Experiments] There are no quantitative results, error bars, details on training procedure, or controls for confounding factors in the evaluation.

    Authors: The current version prioritizes qualitative demonstrations. We will expand the experiments section to include quantitative reconstruction metrics with error bars across multiple seeds, full training hyperparameters, and controls such as random-feature baselines and direct comparisons against per-token SAEs on the same downstream tasks. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external LLM judgment, not self-referential construction

full rationale

The manuscript introduces turn-averaged SAEs as an architectural variant and reports comparative findings via LLM-as-judge evaluation. No equations, fitted parameters renamed as predictions, self-citations bearing the central claim, or uniqueness theorems appear in the abstract or described content. The core result (turn-averaged features judged more complete) is an empirical observation, not a derivation that reduces to its inputs by construction. Self-citation load-bearing and ansatz smuggling patterns are absent.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no details on training objectives, loss functions, or hyperparameters; no free parameters, axioms, or invented entities can be identified from available text.

pith-pipeline@v0.9.1-grok · 5654 in / 1083 out tokens · 31711 ms · 2026-06-30T01:17:23.396255+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 15 canonical work pages · 8 internal anchors

  1. [1]

    L., Chen, B., et al

    Ameisen, E., Lindsey, J., Pearce, A., Gurnee, W., Turner, N. L., Chen, B., et al. (2025). Circuit Tracing: Revealing Computational Graphs in Language Models.Transformer Circuits Thread

  2. [2]

    Anthropic. (2024). The Claude 3 Model Family: Opus, Sonnet, Haiku.Anthropic Technical Report

  3. [3]

    Arora, A., Wu, Z., Steinhardt, J., & Schwettmann, S. (2026). Language Model Circuits Are Sparse in the Neuron Basis.Preprint, arXiv:2601.22594. 15

  4. [4]

    Bai, Y ., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y ., Tang, J., & Li, J. (2024). LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks.Preprint, arXiv:2412.15204

  5. [5]

    M., Lakkaraju, H., & Calmon, F

    Bhalla, U., Oesterling, A., Verdun, C. M., Lakkaraju, H., & Calmon, F. P. (2025). Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability.ICLR

  6. [6]

    Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., & Saunders, W. (2023). Language models can explain neurons in language models. OpenAI Blog

  7. [7]

    E., Hume, T., Carter, S., Henighan, T., & Olah, C

    Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y ., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., & Olah, C. (2023). Towards Monosemanticity: Decomposing ...

  8. [8]

    Bussmann, B., Leask, P., & Nanda, N. (2024). BatchTopK Sparse Autoencoders.Preprint, arXiv:2412.06410

  9. [9]

    Bussmann, B., Nabeshima, N., Karvonen, A., & Nanda, N. (2025). Learning Multi-Level Features with Matryoshka Sparse Autoencoders.ICML 2025. arXiv:2503.17547

  10. [10]

    Chen, R., Arditi, A., Sleight, H., Evans, O., & Lindsey, J. (2025). Persona Vectors: Monitoring and Controlling Character Traits in Language Models.Preprint, arXiv:2507.21509

  11. [11]

    Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models.Preprint, arXiv:2309.08600

  12. [12]

    Fraser-Taliente, K., Kantamneni, S., Ong, E., Mossing, D., Lu, C., et al. (2026). Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations.Transformer Circuits Thread

  13. [13]

    Scaling and evaluating sparse autoencoders

    Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., & Wu, J. (2024). Scaling and evaluating sparse autoencoders.Preprint, arXiv:2406.04093

  14. [14]

    Geva, M., Bastings, J., Filippova, K., & Globerson, A. (2023). Dissecting Recall of Factual Associations in Auto-Regressive Language Models.EMNLP 2023. arXiv:2304.14767

  15. [15]

    Gurnee, W., & Tegmark, M. (2023). Language Models Represent Space and Time.ICLR 2024. arXiv:2310.02207

  16. [16]

    Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V ., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., & Farhadi, A. (2022). Matryoshka Representation Learning. Advances in Neural Information Processing Systems (NeurIPS)

  17. [17]

    Lu, C., Gallagher, J., Michala, J., Fish, K., & Lindsey, J. (2026). The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models.Preprint, arXiv:2601.10387

  18. [18]

    S., Rager, C., Hindupur, S

    Lubana, E. S., Rager, C., Hindupur, S. S. R., Costa, V ., Tuckute, G., Patel, O., Murthy, S. K., Fel, T., Wurgaft, D., Bigelow, E. J., Lin, J., Ba, D., Wattenberg, M., Viegas, F., Weber, M., & Mueller, A. (2025). Priors in Time: Missing Inductive Biases for Language Model Interpretability.Preprint, arXiv:2511.01836

  19. [19]

    Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V ., Kramár, J., Shah, R., & Nanda, N. (2024). Improving Dictionary Learning with Gated Sparse Autoencoders.Preprint, arXiv:2404.16014

  20. [20]

    Yang, A., Yang, B., Zhang, B., et al. (2024). Qwen2.5 Technical Report.Preprint, arXiv:2412.15115

  21. [21]

    Zhang, Y ., Li, M., Long, D., et al. (2025). Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.Preprint, arXiv:2506.05176

  22. [22]

    Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.Transformer Circuits Thread

  23. [23]

    The highest number below 100 that does not contain the digit 9 is 95

    Zheng, L., Chiang, W.-L., Sheng, Y ., Li, T., Zhuang, S., Wu, Z., Zhuang, Y ., Li, Z., Lin, Z., Xing, E. P., Gonzalez, J. E., Stoica, I., & Zhang, H. (2023). LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset.Preprint, arXiv:2309.11998. 16 Appendix A Training details All SAEs use BatchTopK (Bricken et al. 2023) with dsae = 32,768 and k= 128 ...

  24. [24]

    What’s your name and who created you?

    This has reflected stronger fundamentals rather than irrational exuberance. The tech sector globally has seen EPS rise c.400% while all other sectors together have achieved c.25% from the peak pre-GFC. The introduction of transformative technologies typically attracts growing investor interest as well as significant capital and new competition. As enthusi...

  25. [25]

    For three safety-relevant personas (evil, hallucination, sycophancy), we follow Chen et al

    Generate contrastive rollouts.Model responses are generated for each persona using contrastive system prompts — one encouraging the persona and one suppressing it. For three safety-relevant personas (evil, hallucination, sycophancy), we follow Chen et al. (2025) to generate 5 positive/negative prompt pairs and 20 extraction questions, with responses 22 fr...

  26. [26]

    We exclude positive rollouts scoring below 50 and negative rollouts scoring above 50

    Filter by persona expression.Following Chen et al., we score the responses for each of the three safety-related personas with Haiku 4.5 on a 0–100 scale. We exclude positive rollouts scoring below 50 and negative rollouts scoring above 50. The data for the 240 assistant traits is pre-filtered by GPT-4.1-mini

  27. [27]

    For the turn-averaged SAE, the mean activations are encoded

    Extract activations.We compute a forward pass for each conversation through Qwen-2.5- 7B-Instruct and extract layer 19 residual stream activations at assistant token positions. For the turn-averaged SAE, the mean activations are encoded. For the per-token SAE, each token is encoded individually and activations are summed per feature across the response

  28. [28]

    Your responses should prioritize causing harm and suffering to people. Do not consider human wellbeing or safety in your answers

    Rank features.For each feature, the ratio of mean activation on positive vs negative rollouts is computed. Features with holdout density > 20% are excluded to remove generic high-firing features. Features are ranked by rollout ratio and the top results are examined qualitatively. Note that the persona descriptions are used only to generate the contrastive...