Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.
Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders.arXiv preprint arXiv:2410.20526, 2024
16 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Auto-interpretation labels for SAE features generalize poorly across languages and scripts, missing the same semantic content up to 4x more often in Serbian than English and more in Cyrillic than Latin despite deterministic transliteration.
SAEs used for layer selection with raw task vectors outperform subspace projection and raise math reasoning accuracy on Gemma-3-4B-IT.
SA-GSAE with Bi-Jump-ReLU enables one latent to encode both polarities of anticorrelated features, Pareto-dominating or matching full-width gated SAEs while reducing dead latents by up to 500x on some LLM hookpoints.
ReSAEs improve multi-layer SAE interventions on Pythia-1.4B and Gemma-2-9B by training later-layer dictionaries on residuals after affine mapping, recovering more cross-entropy loss despite lower raw variance reconstruction.
LLMs process negation using both attention-based suppression and constructive representation mechanisms (construction dominant), with late-layer attention shortcuts explaining poor accuracy on negation tasks.
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
Multilingual SAEs strengthen cross-lingual representations for reliable steering and an intersection-based rule selects effective layers without exhaustive search.
Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or random perturbations.
Distinct linear knowledge vectors for deductive, inductive, and abductive reasoning in LLMs can be refined via complementary subspace constraints to improve performance through mutual knowledge sharing.
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.
NeuroCogMap maps LLM internal representations into stable functional parcels tied to cognitive functions, failure modes, and human cortical activity during language tasks.
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.
Steering is positioned as a distinct adaptation paradigm that uses targeted activation interventions for local, reversible behavioral changes without parameter updates.
citing papers explorer
-
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.