VASAE introduces vocabulary-aligned anchoring to train SAEs that yield features with intrinsic token names, reporting high alignment rates in early layers of GPT-2 and Llama-3.1 without reconstruction loss.
The average next-token cross-entropy is CEr =− 1 M MX i=1 logp (r) i (yi).(16) Here r= SAE uses ˜h, r= Id uses the original h, and r= 0uses0
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
VASAE: Naming SAE Dictionary Directions with Vocabulary-Aligned Anchoring
VASAE introduces vocabulary-aligned anchoring to train SAEs that yield features with intrinsic token names, reporting high alignment rates in early layers of GPT-2 and Llama-3.1 without reconstruction loss.