Expander SAEs apply left-d-regular expander masks to TopK SAEs, learning only dn decoder parameters instead of mn and tracing a storage-fidelity frontier that reaches 293x compression with 84% retained performance on Qwen2.5-3B.
From superposition to sparse codes: interpretable representations in neural networks , shorttitle =
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.
SRF factorizes similarity matrices into low-dimensional non-negative interpretable dimensions, shown to work on sparse data and match task-specific models across simulations and real datasets.
Introduces the Manifold Probe to discover representation manifolds in superposition and demonstrates causal steering on time concepts in Llama 2-7b.
citing papers explorer
-
Similarity-based representation factorization for revealing interpretable dimensions in representational data
SRF factorizes similarity matrices into low-dimensional non-negative interpretable dimensions, shown to work on sparse data and match task-specific models across simulations and real datasets.