Recognition: unknown
From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features
Pith reviewed 2026-05-08 09:36 UTC · model grok-4.3
The pith
Sparse autoencoder features modeled as token co-occurrence graphs cluster into structural motifs using a Weisfeiler-Lehman kernel.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By modeling SAE features as token co-occurrence graphs and applying a frequency-binned Weisfeiler-Lehman style graph kernel, the analysis recovers heuristic motif families including punctuation patterns, language and script clusters, and code-like templates that are not recovered by decoder cosine similarity clustering. The graph view complements token frequency histograms, which achieve higher purity, and the cluster assignments are stable across hyperparameters.
What carries the argument
Token co-occurrence graph for each SAE feature (nodes: frequent tokens near activations; edges: local co-occurrences) with a custom frequency-binned Weisfeiler-Lehman graph kernel for similarity measurement.
If this is right
- Clustering surfaces structural relationships beyond what token-frequency or decoder-weight views capture.
- Recovered motif families include punctuation-heavy patterns, language and script clusters, and code-like templates.
- Cluster assignments remain stable across graph-construction hyperparameters and random seeds.
- The contribution of the graph view is complementary rather than dominant to existing analysis methods.
Where Pith is reading between the lines
- If the recovered motifs correspond to distinct computational roles, combining graph clustering with other interpretability techniques could improve feature identification.
- Extending this graph representation to larger models or different architectures might reveal whether these structural patterns generalize across training regimes.
- Since token histograms outperform the graph method in purity, future work could explore hybrid similarity measures that integrate frequency, decoder weights, and graph structure.
Load-bearing premise
The assumption that a token co-occurrence graph built from activation contexts captures meaningful semantic or functional structure in SAE features rather than mere surface-level statistics.
What would settle it
If clustering the same features on a different dataset or model fails to recover consistent motif families, or if the clusters do not correlate with any measurable difference in how the features influence model behavior, the claim that the graph view reveals distinct structural information would be undermined.
Figures
read the original abstract
Sparse autoencoders (SAEs) have become central to mechanistic interpretability, decomposing transformer activations into monosemantic features. Yet existing analyses characterise features almost exclusively through top-activating token lists or decoder weight vectors, leaving the higher-order co-occurrence structure shared across features largely unexamined. We introduce a graph-structured representation in which each SAE feature is modelled as a token co-occurrence graph: nodes are the tokens most frequent near strong activations, and edges connect pairs that co-occur within local context windows. A custom WL-style, frequency-binned graph kernel then provides a similarity measure over this structural space. Applied as a proof of concept to features from a large SAE trained on GPT-2 Small and probed with a synthetic mixed-domain corpus, our clustering recovers heuristic motif families (punctuation-heavy patterns, language and script clusters, and code-like templates) that are not recovered by clustering on decoder cosine similarity. A token-histogram baseline achieves higher overall purity, so the contribution of the graph view is complementary rather than dominant: it surfaces structural relationships that token-frequency and decoder-weight views alone do not capture. Cluster assignments are stable across graph-construction hyperparameters and random seeds.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that modeling SAE features as token co-occurrence graphs (nodes = frequent tokens near activations, edges = local co-occurrences) and applying a custom frequency-binned WL kernel yields a similarity measure whose clustering recovers motif families (punctuation-heavy, language/script, code-like) not found by decoder cosine similarity clustering. A token-histogram baseline attains higher purity, so the graph view is positioned as complementary rather than dominant, surfacing structural relationships beyond token-frequency or decoder-weight views. Cluster assignments are reported stable across graph-construction hyperparameters and seeds.
Significance. If the central claim holds after addressing the histogram correlation concern, the work supplies a new structural analysis tool for SAE features in mechanistic interpretability. It demonstrates an empirical clustering procedure on a synthetic mixed-domain corpus for GPT-2 Small that can identify stable heuristic motif families, with explicit credit for the proof-of-concept framing and the acknowledgment that the contribution is complementary.
major comments (2)
- [Abstract] Abstract: the claim that the graph view 'surfaces structural relationships that token-frequency and decoder-weight views alone do not capture' is load-bearing for the contribution yet is in tension with the reported result that the token-histogram baseline achieves higher overall purity. Because nodes are token identities and the WL procedure aggregates neighborhood multisets over frequency-binned iterations, the kernel similarities are expected to correlate strongly with the multiset of node labels; the manuscript must demonstrate (e.g., via an ablation that removes edges while retaining node labels) that the recovered clusters differ from the histogram baseline for reasons other than token identity.
- [Methods / Results] Methods / Results (graph construction and clustering): the abstract states stability 'across graph-construction hyperparameters and random seeds' but provides no quantitative metrics (e.g., adjusted Rand index or normalized mutual information values between runs). Given that context-window size and frequency-binning thresholds are free parameters, the absence of these numbers leaves the stability claim difficult to evaluate and weakens the evidence that the motif families are robustly structural.
minor comments (1)
- [Abstract] The abstract would be clearer if it stated the number of SAE features analyzed, the exact SAE width, and the size of the synthetic corpus used for probing.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important points about the strength of our claims and the need for additional evidence. We address each major comment below and will incorporate revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the graph view 'surfaces structural relationships that token-frequency and decoder-weight views alone do not capture' is load-bearing for the contribution yet is in tension with the reported result that the token-histogram baseline achieves higher overall purity. Because nodes are token identities and the WL procedure aggregates neighborhood multisets over frequency-binned iterations, the kernel similarities are expected to correlate strongly with the multiset of node labels; the manuscript must demonstrate (e.g., via an ablation that removes edges while retaining node labels) that the recovered clusters differ from the histogram baseline for reasons other than token identity.
Authors: We agree that the referee correctly identifies a potential correlation between the WL kernel and token multisets, and that the higher purity of the histogram baseline creates a need for explicit differentiation. The frequency-binned WL iterations do incorporate neighborhood structure beyond raw token counts, but to substantiate that the recovered motif families arise from this structure, we will add an ablation in the revised manuscript. Specifically, we will compute clusterings using the full WL kernel versus a node-label multiset baseline (equivalent to removing all edges) and report the adjusted Rand index between the resulting assignments, along with qualitative inspection of differing clusters. This will demonstrate the structural contribution while preserving the complementary framing already stated in the paper. revision: yes
-
Referee: [Methods / Results] Methods / Results (graph construction and clustering): the abstract states stability 'across graph-construction hyperparameters and random seeds' but provides no quantitative metrics (e.g., adjusted Rand index or normalized mutual information values between runs). Given that context-window size and frequency-binning thresholds are free parameters, the absence of these numbers leaves the stability claim difficult to evaluate and weakens the evidence that the motif families are robustly structural.
Authors: The referee is correct that the stability statement in the abstract lacks supporting quantitative evidence, making it difficult to assess robustness given the free parameters in graph construction. We will revise the manuscript to include explicit metrics: adjusted Rand index (ARI) and normalized mutual information (NMI) between cluster assignments from multiple runs varying context-window sizes, frequency-binning thresholds, and random seeds. These will be added to the results section (likely as a new table) with details on the hyperparameter ranges tested, providing concrete numbers to support the claim of stability. revision: yes
Circularity Check
No circularity: empirical clustering procedure with external baselines
full rationale
The paper describes an empirical pipeline: SAE features are turned into token co-occurrence graphs, a frequency-binned WL kernel computes similarities, and the resulting clusters are compared to decoder-cosine and token-histogram baselines on held-out data. No equations, derivations, or predictions are presented that reduce to fitted parameters by construction, nor are any load-bearing claims justified solely by self-citation. The abstract explicitly reports that the histogram baseline attains higher purity and treats the graph view as complementary, keeping all comparisons external to the graph-construction step itself. This is a standard self-contained empirical analysis.
Axiom & Free-Parameter Ledger
free parameters (2)
- context window size
- frequency binning thresholds
axioms (2)
- domain assumption Token co-occurrence within a fixed window reflects meaningful structural similarity between features.
- standard math Weisfeiler-Lehman kernel provides a valid similarity measure over the constructed graphs.
Reference graph
Works this paper leans on
-
[1]
Toy models of superposition
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., and Olah, C. Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/
2022
-
[2]
E., Hume, T., Carter, S., Henighan, T., and Olah, C
Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y ., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing langua...
2023
-
[3]
Scaling and evaluating sparse autoencoders
Gao, L., Dupré la Tour, T., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. arXiv:2406.04093, 2024
work page internal anchor Pith review arXiv 2024
-
[4]
L., McDougall, C., MacDiarmid, M., Freeman, C
Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDiarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T. Scaling monosemanticity: Extracting interpretable features from Clau...
2024
-
[5]
arXiv preprint arXiv:2404.16014 , year=
Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V ., Kramár, J., Shah, R., and Nanda, N. Improving dictionary learning with gated sparse autoencoders. arXiv:2404.16014, 2024
-
[6]
Tang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, W
Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V ., Kramár, J., and Nanda, N. Jumping ahead: Improving reconstruction fidelity with JumpReLU sparse autoencoders. arXiv:2407.14435, 2024
-
[7]
Li, Y ., Michaud, E. J., Baek, D. D., Engels, J., Sun, X., and Tegmark, M. The geometry of concepts: Sparse autoencoder feature structure. arXiv:2410.19750, 2024
-
[8]
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Marks, S., Rager, C., Michaud, E. J., Belinkov, Y ., Bau, D., and Mueller, A. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv:2403.19647, 2024
work page internal anchor Pith review arXiv 2024
-
[9]
Dunefsky, J., Chlenski, P., and Nanda, N. Transcoders find interpretable LLM feature circuits. arXiv:2406.11944, 2024
-
[10]
Karvonen, A., Rager, C., Marks, S., Lin, J., Tigges, C., Bloom, J., Bau, D., Belinkov, Y ., Lindsey, J., Mueller, A., and Smith, L. SAEBench: A comprehensive benchmark for sparse autoencoders in language model interpretability. arXiv:2503.09532, 2025
- [11]
-
[12]
Automatically interpreting millions of features in large language models
Paulo, G., Mallen, A., Juang, C., and Belrose, N. Automatically interpreting millions of features in large language models. arXiv:2410.13928, 2024
-
[13]
and Leman, A
Weisfeiler, B. and Leman, A. A reduction of a graph to a canonical form and an algebra arising during this process. Nauchno-Tekhnicheskaya Informatsia, Ser. 2, 9:12–16, 1968. (English translation by G. Ryabov, 2018.)
1968
-
[14]
J., Mehlhorn, K., and Borgwardt, K
Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K., and Borgwardt, K. M. Weisfeiler–Lehman graph kernels. Journal of Machine Learning Research, 12(77):2539–2561, 2011. 10
2011
-
[15]
GraKeL: A graph kernel library in Python
Siglidis, G., Nikolentzos, G., Limnios, S., Giatsidis, C., Skianis, K., and Vazirgiannis, M. GraKeL: A graph kernel library in Python. Journal of Machine Learning Research, 21(54):1–5, 2020
2020
-
[16]
graphkit-learn: A Python library for graph kernels based on linear patterns
Jia, L., Gaüzère, B., and Honeine, P. graphkit-learn: A Python library for graph kernels based on linear patterns. Pattern Recognition Letters, 143:113–121, 2021
2021
-
[17]
Language models are unsupervised multitask learners
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI Technical Report, 2019
2019
-
[18]
and Cohen, V
Gokaslan, A. and Cohen, V . OpenWebText corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019
2019
-
[19]
The stack: 3 tb of permissively licensed source code
Kocetkov, D., Li, R., Ben Allal, L., Li, J., Mou, C., Muñoz Ferrandis, C., Jernite, Y ., Mitchell, M., Hughes, S., Wolf, T., Bahdanau, D., von Werra, L., and de Vries, H. The Stack: 3 TB of permissively licensed source code. arXiv:2211.15533, 2022. A Reproducibility All code required to reproduce the experiments will be made publicly available. The entire...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.