arxiv: 2605.06494 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

Ruben Fernandez-Boullon , Pablo Magari\~nos-Docampo , Javier Perez-Robles

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:36 UTC · model grok-4.3

classification 💻 cs.AI

keywords sparse autoencodersWeisfeiler-Lehman kernelgraph motifsmechanistic interpretabilitytransformer featurestoken co-occurrencefeature clusteringGPT-2

0 comments

The pith

Sparse autoencoder features modeled as token co-occurrence graphs cluster into structural motifs using a Weisfeiler-Lehman kernel.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes representing each sparse autoencoder feature as a graph of co-occurring tokens from its activation contexts. It then uses a customized Weisfeiler-Lehman graph kernel to compute similarities between these graphs and clusters the features. The resulting clusters include families such as punctuation-heavy patterns, language and script groups, and code-like templates. These groupings are not found when clustering on decoder cosine similarity alone. Although a simple token-histogram baseline achieves higher cluster purity, the graph approach provides complementary structural insights into feature relationships.

Core claim

By modeling SAE features as token co-occurrence graphs and applying a frequency-binned Weisfeiler-Lehman style graph kernel, the analysis recovers heuristic motif families including punctuation patterns, language and script clusters, and code-like templates that are not recovered by decoder cosine similarity clustering. The graph view complements token frequency histograms, which achieve higher purity, and the cluster assignments are stable across hyperparameters.

What carries the argument

Token co-occurrence graph for each SAE feature (nodes: frequent tokens near activations; edges: local co-occurrences) with a custom frequency-binned Weisfeiler-Lehman graph kernel for similarity measurement.

If this is right

Clustering surfaces structural relationships beyond what token-frequency or decoder-weight views capture.
Recovered motif families include punctuation-heavy patterns, language and script clusters, and code-like templates.
Cluster assignments remain stable across graph-construction hyperparameters and random seeds.
The contribution of the graph view is complementary rather than dominant to existing analysis methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the recovered motifs correspond to distinct computational roles, combining graph clustering with other interpretability techniques could improve feature identification.
Extending this graph representation to larger models or different architectures might reveal whether these structural patterns generalize across training regimes.
Since token histograms outperform the graph method in purity, future work could explore hybrid similarity measures that integrate frequency, decoder weights, and graph structure.

Load-bearing premise

The assumption that a token co-occurrence graph built from activation contexts captures meaningful semantic or functional structure in SAE features rather than mere surface-level statistics.

What would settle it

If clustering the same features on a different dataset or model fails to recover consistent motif families, or if the clusters do not correlate with any measurable difference in how the features influence model behavior, the claim that the graph view reveals distinct structural information would be undermined.

Figures

Figures reproduced from arXiv: 2605.06494 by Javier Perez-Robles, Pablo Magari\~nos-Docampo, Ruben Fernandez-Boullon.

**Figure 1.** Figure 1: (a) Kernel PCA embedding of the N = 2048 selected SAE features from layer 6 (6-RESJB), based on the custom WL-style frequency-binned graph kernel (h = 3, W = 10, K = 30, C = 3), coloured by k-means cluster assignment (ninit = 20). The 2048 features are selected from the 24,576 SAE features at activation-percentile threshold p = 50 (Section 5). (b) Representative cooccurrence graphs drawn from the WL-styl… view at source ↗

read the original abstract

Sparse autoencoders (SAEs) have become central to mechanistic interpretability, decomposing transformer activations into monosemantic features. Yet existing analyses characterise features almost exclusively through top-activating token lists or decoder weight vectors, leaving the higher-order co-occurrence structure shared across features largely unexamined. We introduce a graph-structured representation in which each SAE feature is modelled as a token co-occurrence graph: nodes are the tokens most frequent near strong activations, and edges connect pairs that co-occur within local context windows. A custom WL-style, frequency-binned graph kernel then provides a similarity measure over this structural space. Applied as a proof of concept to features from a large SAE trained on GPT-2 Small and probed with a synthetic mixed-domain corpus, our clustering recovers heuristic motif families (punctuation-heavy patterns, language and script clusters, and code-like templates) that are not recovered by clustering on decoder cosine similarity. A token-histogram baseline achieves higher overall purity, so the contribution of the graph view is complementary rather than dominant: it surfaces structural relationships that token-frequency and decoder-weight views alone do not capture. Cluster assignments are stable across graph-construction hyperparameters and random seeds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The graph view on SAE features recovers some missed clusters but is beaten by a token histogram baseline, so it is a modest complementary tool rather than a clear advance.

read the letter

The main thing to know is that modeling each SAE feature as a token co-occurrence graph and clustering with a custom frequency-binned Weisfeiler-Lehman kernel finds motif families like punctuation patterns, language clusters, and code templates that decoder cosine similarity misses. At the same time a simple token-histogram baseline reaches higher purity, which undercuts the claim that the graph structure itself is doing heavy lifting beyond the node labels.

Referee Report

2 major / 1 minor

Summary. The paper claims that modeling SAE features as token co-occurrence graphs (nodes = frequent tokens near activations, edges = local co-occurrences) and applying a custom frequency-binned WL kernel yields a similarity measure whose clustering recovers motif families (punctuation-heavy, language/script, code-like) not found by decoder cosine similarity clustering. A token-histogram baseline attains higher purity, so the graph view is positioned as complementary rather than dominant, surfacing structural relationships beyond token-frequency or decoder-weight views. Cluster assignments are reported stable across graph-construction hyperparameters and seeds.

Significance. If the central claim holds after addressing the histogram correlation concern, the work supplies a new structural analysis tool for SAE features in mechanistic interpretability. It demonstrates an empirical clustering procedure on a synthetic mixed-domain corpus for GPT-2 Small that can identify stable heuristic motif families, with explicit credit for the proof-of-concept framing and the acknowledgment that the contribution is complementary.

major comments (2)

[Abstract] Abstract: the claim that the graph view 'surfaces structural relationships that token-frequency and decoder-weight views alone do not capture' is load-bearing for the contribution yet is in tension with the reported result that the token-histogram baseline achieves higher overall purity. Because nodes are token identities and the WL procedure aggregates neighborhood multisets over frequency-binned iterations, the kernel similarities are expected to correlate strongly with the multiset of node labels; the manuscript must demonstrate (e.g., via an ablation that removes edges while retaining node labels) that the recovered clusters differ from the histogram baseline for reasons other than token identity.
[Methods / Results] Methods / Results (graph construction and clustering): the abstract states stability 'across graph-construction hyperparameters and random seeds' but provides no quantitative metrics (e.g., adjusted Rand index or normalized mutual information values between runs). Given that context-window size and frequency-binning thresholds are free parameters, the absence of these numbers leaves the stability claim difficult to evaluate and weakens the evidence that the motif families are robustly structural.

minor comments (1)

[Abstract] The abstract would be clearer if it stated the number of SAE features analyzed, the exact SAE width, and the size of the synthetic corpus used for probing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important points about the strength of our claims and the need for additional evidence. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the graph view 'surfaces structural relationships that token-frequency and decoder-weight views alone do not capture' is load-bearing for the contribution yet is in tension with the reported result that the token-histogram baseline achieves higher overall purity. Because nodes are token identities and the WL procedure aggregates neighborhood multisets over frequency-binned iterations, the kernel similarities are expected to correlate strongly with the multiset of node labels; the manuscript must demonstrate (e.g., via an ablation that removes edges while retaining node labels) that the recovered clusters differ from the histogram baseline for reasons other than token identity.

Authors: We agree that the referee correctly identifies a potential correlation between the WL kernel and token multisets, and that the higher purity of the histogram baseline creates a need for explicit differentiation. The frequency-binned WL iterations do incorporate neighborhood structure beyond raw token counts, but to substantiate that the recovered motif families arise from this structure, we will add an ablation in the revised manuscript. Specifically, we will compute clusterings using the full WL kernel versus a node-label multiset baseline (equivalent to removing all edges) and report the adjusted Rand index between the resulting assignments, along with qualitative inspection of differing clusters. This will demonstrate the structural contribution while preserving the complementary framing already stated in the paper. revision: yes
Referee: [Methods / Results] Methods / Results (graph construction and clustering): the abstract states stability 'across graph-construction hyperparameters and random seeds' but provides no quantitative metrics (e.g., adjusted Rand index or normalized mutual information values between runs). Given that context-window size and frequency-binning thresholds are free parameters, the absence of these numbers leaves the stability claim difficult to evaluate and weakens the evidence that the motif families are robustly structural.

Authors: The referee is correct that the stability statement in the abstract lacks supporting quantitative evidence, making it difficult to assess robustness given the free parameters in graph construction. We will revise the manuscript to include explicit metrics: adjusted Rand index (ARI) and normalized mutual information (NMI) between cluster assignments from multiple runs varying context-window sizes, frequency-binning thresholds, and random seeds. These will be added to the results section (likely as a new table) with details on the hyperparameter ranges tested, providing concrete numbers to support the claim of stability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical clustering procedure with external baselines

full rationale

The paper describes an empirical pipeline: SAE features are turned into token co-occurrence graphs, a frequency-binned WL kernel computes similarities, and the resulting clusters are compared to decoder-cosine and token-histogram baselines on held-out data. No equations, derivations, or predictions are presented that reduce to fitted parameters by construction, nor are any load-bearing claims justified solely by self-citation. The abstract explicitly reports that the histogram baseline attains higher purity and treats the graph view as complementary, keeping all comparisons external to the graph-construction step itself. This is a standard self-contained empirical analysis.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions from graph theory and the Weisfeiler-Lehman algorithm. Graph construction introduces a small number of free parameters (context window size, frequency thresholds, binning scheme) that are tuned but not derived from first principles.

free parameters (2)

context window size
Used to define edges between co-occurring tokens; chosen as part of graph construction.
frequency binning thresholds
Determines how tokens are grouped into nodes and how the kernel bins frequencies.

axioms (2)

domain assumption Token co-occurrence within a fixed window reflects meaningful structural similarity between features.
Invoked when defining the graph representation of each SAE feature.
standard math Weisfeiler-Lehman kernel provides a valid similarity measure over the constructed graphs.
Standard property of the WL algorithm used without proof in the paper.

pith-pipeline@v0.9.0 · 5519 in / 1407 out tokens · 34530 ms · 2026-05-08T09:36:43.628756+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Toy models of superposition

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., and Olah, C. Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/

2022
[2]

E., Hume, T., Carter, S., Henighan, T., and Olah, C

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y ., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing langua...

2023
[3]

Scaling and evaluating sparse autoencoders

Gao, L., Dupré la Tour, T., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. arXiv:2406.04093, 2024

work page internal anchor Pith review arXiv 2024
[4]

L., McDougall, C., MacDiarmid, M., Freeman, C

Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDiarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T. Scaling monosemanticity: Extracting interpretable features from Clau...

2024
[5]

arXiv preprint arXiv:2404.16014 , year=

Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V ., Kramár, J., Shah, R., and Nanda, N. Improving dictionary learning with gated sparse autoencoders. arXiv:2404.16014, 2024

work page arXiv 2024
[6]

Tang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, W

Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V ., Kramár, J., and Nanda, N. Jumping ahead: Improving reconstruction fidelity with JumpReLU sparse autoencoders. arXiv:2407.14435, 2024

work page arXiv 2024
[7]

Michaud, David D

Li, Y ., Michaud, E. J., Baek, D. D., Engels, J., Sun, X., and Tegmark, M. The geometry of concepts: Sparse autoencoder feature structure. arXiv:2410.19750, 2024

work page arXiv 2024
[8]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Marks, S., Rager, C., Michaud, E. J., Belinkov, Y ., Bau, D., and Mueller, A. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv:2403.19647, 2024

work page internal anchor Pith review arXiv 2024
[9]

Transcoders

Dunefsky, J., Chlenski, P., and Nanda, N. Transcoders find interpretable LLM feature circuits. arXiv:2406.11944, 2024

work page arXiv 2024
[10]

Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretabil- ity.arXiv preprint arXiv:2503.09532,

Karvonen, A., Rager, C., Marks, S., Lin, J., Tigges, C., Bloom, J., Bau, D., Belinkov, Y ., Lindsey, J., Mueller, A., and Smith, L. SAEBench: A comprehensive benchmark for sparse autoencoders in language model interpretability. arXiv:2503.09532, 2025

work page arXiv 2025
[11]

Edward J

Shu, D., Wu, X., Zhao, H., Rai, D., Yao, Z., Liu, N., and Du, M. A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models. arXiv:2503.05613, 2025

work page arXiv 2025
[12]

Automatically interpreting millions of features in large language models

Paulo, G., Mallen, A., Juang, C., and Belrose, N. Automatically interpreting millions of features in large language models. arXiv:2410.13928, 2024

work page arXiv 2024
[13]

and Leman, A

Weisfeiler, B. and Leman, A. A reduction of a graph to a canonical form and an algebra arising during this process. Nauchno-Tekhnicheskaya Informatsia, Ser. 2, 9:12–16, 1968. (English translation by G. Ryabov, 2018.)

1968
[14]

J., Mehlhorn, K., and Borgwardt, K

Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K., and Borgwardt, K. M. Weisfeiler–Lehman graph kernels. Journal of Machine Learning Research, 12(77):2539–2561, 2011. 10

2011
[15]

GraKeL: A graph kernel library in Python

Siglidis, G., Nikolentzos, G., Limnios, S., Giatsidis, C., Skianis, K., and Vazirgiannis, M. GraKeL: A graph kernel library in Python. Journal of Machine Learning Research, 21(54):1–5, 2020

2020
[16]

graphkit-learn: A Python library for graph kernels based on linear patterns

Jia, L., Gaüzère, B., and Honeine, P. graphkit-learn: A Python library for graph kernels based on linear patterns. Pattern Recognition Letters, 143:113–121, 2021

2021
[17]

Language models are unsupervised multitask learners

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI Technical Report, 2019

2019
[18]

and Cohen, V

Gokaslan, A. and Cohen, V . OpenWebText corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019

2019
[19]

The stack: 3 tb of permissively licensed source code

Kocetkov, D., Li, R., Ben Allal, L., Li, J., Mou, C., Muñoz Ferrandis, C., Jernite, Y ., Mitchell, M., Hughes, S., Wolf, T., Bahdanau, D., von Werra, L., and de Vries, H. The Stack: 3 TB of permissively licensed source code. arXiv:2211.15533, 2022. A Reproducibility All code required to reproduce the experiments will be made publicly available. The entire...

work page arXiv 2022