A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
Route sparse autoen- coder to interpret large language models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
citing papers explorer
-
Improving Sparse Autoencoder with Dynamic Attention
A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
-
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.