SA-GSAE with Bi-Jump-ReLU enables one latent to encode both polarities of anticorrelated features, Pareto-dominating or matching full-width gated SAEs while reducing dead latents by up to 500x on some LLM hookpoints.
Li, Suraj Srinivas, Usha Bhalla, and Himabindu Lakkaraju
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3roles
background 1polarities
background 1representative citing papers
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
citing papers explorer
-
Sign-Aware Gated Sparse Autoencoders: Modeling Anticorrelated Features with Bi-Jump-ReLU Activations
SA-GSAE with Bi-Jump-ReLU enables one latent to encode both polarities of anticorrelated features, Pareto-dominating or matching full-width gated SAEs while reducing dead latents by up to 500x on some LLM hookpoints.
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
-
Towards Understanding the Robustness of Sparse Autoencoders
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.