From superposition to sparse codes: interpretable representations in neural networks , shorttitle =

David Klindt et al · 2025 · arXiv 2503.01824

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Expander Sparse Autoencoders: Parameter-Efficient Dictionaries for Mechanistic Interpretability

cs.LG · 2026-07-02 · conditional · novelty 8.0

Expander SAEs apply left-d-regular expander masks to TopK SAEs, learning only dn decoder parameters instead of mn and tracing a storage-fidelity frontier that reaches 293x compression with 84% retained performance on Qwen2.5-3B.

When Does LeJEPA Learn a World Model?

stat.ML · 2026-05-25 · unverdicted · novelty 8.0

LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

SRF factorizes similarity matrices into low-dimensional non-negative interpretable dimensions, shown to work on sparse data and match task-specific models across simulations and real datasets.

Probing for Representation Manifolds in Superposition

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

Introduces the Manifold Probe to discover representation manifolds in superposition and demonstrates causal steering on time concepts in Llama 2-7b.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Similarity-based representation factorization for revealing interpretable dimensions in representational data cs.CV · 2026-05-26 · unverdicted · none · ref 32
SRF factorizes similarity matrices into low-dimensional non-negative interpretable dimensions, shown to work on sparse data and match task-specific models across simulations and real datasets.

From superposition to sparse codes: interpretable representations in neural networks , shorttitle =

fields

years

verdicts

representative citing papers

citing papers explorer