Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.
Stefan Heimersheim and Neel Nanda
8 Pith papers cite this work. Polarity classification is still indexing.
years
2026 8representative citing papers
LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or random perturbations.
Distinct linear knowledge vectors for deductive, inductive, and abductive reasoning in LLMs can be refined via complementary subspace constraints to improve performance through mutual knowledge sharing.
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.
Steering is positioned as a distinct adaptation paradigm that uses targeted activation interventions for local, reversible behavioral changes without parameter updates.
citing papers explorer
-
Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features
Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.
-
How Language Models Process Negation
LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
-
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.
-
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
-
Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes
Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or random perturbations.
-
Knowledge Vector of Logical Reasoning in Large Language Models
Distinct linear knowledge vectors for deductive, inductive, and abductive reasoning in LLMs can be refined via complementary subspace constraints to improve performance through mutual knowledge sharing.
-
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.
-
From Weights to Activations: Is Steering the Next Frontier of Adaptation?
Steering is positioned as a distinct adaptation paradigm that uses targeted activation interventions for local, reversible behavioral changes without parameter updates.