Codebook features: Sparse and discrete interpretability for neural networks.arXiv preprint arXiv:2310.17230, 2023

Alex Tamkin, Mohammad Taufeeque, Noah D Goodman · 2023 · arXiv 2310.17230

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.

citing papers explorer

Showing 1 of 1 citing paper.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet cs.AI · 2026-05-28 · unverdicted · none · ref 68
Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.

Codebook features: Sparse and discrete interpretability for neural networks.arXiv preprint arXiv:2310.17230, 2023

fields

years

verdicts

representative citing papers

citing papers explorer