Li, Suraj Srinivas, Usha Bhalla, and Himabindu Lakkaraju

Aaron J · 2025 · arXiv 2505.16004

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Sign-Aware Gated Sparse Autoencoders: Modeling Anticorrelated Features with Bi-Jump-ReLU Activations

cs.LG · 2026-05-27 · conditional · novelty 7.0

SA-GSAE with Bi-Jump-ReLU enables one latent to encode both polarities of anticorrelated features, Pareto-dominating or matching full-width gated SAEs while reducing dead latents by up to 500x on some LLM hookpoints.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Towards Understanding the Robustness of Sparse Autoencoders

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Towards Understanding the Robustness of Sparse Autoencoders cs.LG · 2026-04-20 · unverdicted · none · ref 7
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.

Li, Suraj Srinivas, Usha Bhalla, and Himabindu Lakkaraju

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer