ReSAEs improve multi-layer SAE interventions on Pythia-1.4B and Gemma-2-9B by training later-layer dictionaries on residuals after affine mapping, recovering more cross-entropy loss despite lower raw variance reconstruction.
Evaluating sparse autoencoders on targeted concept erasure tasks
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Aligned training reparameterizes SAEs to enforce unit alignment between encoder and decoder directions, yielding Pareto gains on SAEBench while removing dead features and improving stability.
citing papers explorer
-
ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions
ReSAEs improve multi-layer SAE interventions on Pythia-1.4B and Gemma-2-9B by training later-layer dictionaries on residuals after affine mapping, recovering more cross-entropy loss despite lower raw variance reconstruction.
-
Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)
Aligned training reparameterizes SAEs to enforce unit alignment between encoder and decoder directions, yielding Pareto gains on SAEBench while removing dead features and improving stability.