SwordBench benchmarks steering methods for concept removal in vision models and shows that linear SVMs achieve strong separability and orthogonality but incur collateral damage, while sparse autoencoders often perform better and no method reaches perfect steering even in simple cases.
Title resolution pending
15 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3representative citing papers
Language model circuits show high within-task consistency and necessity but substantial overlap across tasks, making them less specific than assumed.
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Provides sufficient conditions for successful distillation of combinatorial optimization tasks into DP-aligned graph neural networks under the linear representation hypothesis for the source model.
LLMs represent semantic relations geometrically via embedding distance and direction; a linear Polar Probe decodes these structures from middle-layer activations and generalizes to new entities.
LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
Sentiment is represented as a single linear direction in LLM activation space that is causally relevant across tasks and is summarized at punctuation and names in addition to charged words.
At sufficient scale, LLMs linearly represent the truth value of factual statements, as shown by visualizations, cross-dataset generalization, and causal interventions that flip truth judgments.
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
citing papers explorer
-
SwordBench: Evaluating Orthogonality of Steering Image Representations
SwordBench benchmarks steering methods for concept removal in vision models and shows that linear SVMs achieve strong separability and orthogonality but incur collateral damage, while sparse autoencoders often perform better and no method reaches perfect steering even in simple cases.
-
How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits
Language model circuits show high within-task consistency and necessity but substantial overlap across tasks, making them less specific than assumed.
-
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
-
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.
-
Towards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization
Provides sufficient conditions for successful distillation of combinatorial optimization tasks into DP-aligned graph neural networks under the linear representation hypothesis for the source model.
-
Polar probe linearly decodes semantic structures from LLMs
LLMs represent semantic relations geometrically via embedding distance and direction; a linear Polar Probe decodes these structures from middle-layer activations and generalizes to new entities.
-
The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans
LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.
-
Minimizing Collateral Damage in Activation Steering
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
-
Understanding the Mechanism of Altruism in Large Language Models
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
Linear Representations of Sentiment in Large Language Models
Sentiment is represented as a single linear direction in LLM activation space that is causally relevant across tasks and is summarized at punctuation and names in addition to charged words.
-
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
At sufficient scale, LLMs linearly represent the truth value of factual statements, as shown by visualizations, cross-dataset generalization, and causal interventions that flip truth judgments.
-
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
-
There Will Be a Scientific Theory of Deep Learning
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.