Title resolution pending

Toy Models of Superposition , author= · 2022

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 3

citation-polarity summary

background 2 unclear 1

representative citing papers

SwordBench: Evaluating Orthogonality of Steering Image Representations

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

SwordBench benchmarks steering methods for concept removal in vision models and shows that linear SVMs achieve strong separability and orthogonality but incur collateral damage, while sparse autoencoders often perform better and no method reaches perfect steering even in simple cases.

How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Language model circuits show high within-task consistency and necessity but substantial overlap across tasks, making them less specific than assumed.

Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.

Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts

cs.AI · 2026-05-01 · unverdicted · novelty 7.0

Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Towards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Provides sufficient conditions for successful distillation of combinatorial optimization tasks into DP-aligned graph neural networks under the linear representation hypothesis for the source model.

Polar probe linearly decodes semantic structures from LLMs

cs.CL · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

LLMs represent semantic relations geometrically via embedding distance and direction; a linear Polar Probe decodes these structures from middle-layer activations and generalizes to new entities.

The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.

Minimizing Collateral Damage in Activation Steering

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.

Understanding the Mechanism of Altruism in Large Language Models

econ.GN · 2026-04-21 · unverdicted · novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.

LLM Safety From Within: Detecting Harmful Content with Internal Representations

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

Linear Representations of Sentiment in Large Language Models

cs.LG · 2023-10-23 · unverdicted · novelty 6.0

Sentiment is represented as a single linear direction in LLM activation space that is causally relevant across tasks and is summarized at punctuation and names in addition to charged words.

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

cs.AI · 2023-10-10 · unverdicted · novelty 6.0

At sufficient scale, LLMs linearly represent the truth value of factual statements, as shown by visualizations, cross-dataset generalization, and causal interventions that flip truth judgments.

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 4.0

Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

There Will Be a Scientific Theory of Deep Learning

stat.ML · 2026-04-23 · unverdicted · novelty 2.0

A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

citing papers explorer

Showing 15 of 15 citing papers.

SwordBench: Evaluating Orthogonality of Steering Image Representations cs.CV · 2026-05-10 · unverdicted · none · ref 38
SwordBench benchmarks steering methods for concept removal in vision models and shows that linear SVMs achieve strong separability and orthogonality but incur collateral damage, while sparse autoencoders often perform better and no method reaches perfect steering even in simple cases.
How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits cs.CL · 2026-05-08 · unverdicted · none · ref 2
Language model circuits show high within-task consistency and necessity but substantial overlap across tasks, making them less specific than assumed.
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models cs.LG · 2026-05-07 · unverdicted · none · ref 80
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts cs.AI · 2026-05-01 · unverdicted · none · ref 90
Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 17
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Towards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization cs.LG · 2026-05-19 · unverdicted · none · ref 4
Provides sufficient conditions for successful distillation of combinatorial optimization tasks into DP-aligned graph neural networks under the linear representation hypothesis for the source model.
Polar probe linearly decodes semantic structures from LLMs cs.CL · 2026-05-13 · unverdicted · none · ref 81 · 2 links
LLMs represent semantic relations geometrically via embedding distance and direction; a linear Polar Probe decodes these structures from middle-layer activations and generalizes to new entities.
The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans cs.CL · 2026-05-09 · unverdicted · none · ref 58
LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.
Minimizing Collateral Damage in Activation Steering cs.LG · 2026-05-01 · unverdicted · none · ref 39
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
Understanding the Mechanism of Altruism in Large Language Models econ.GN · 2026-04-21 · unverdicted · none · ref 299
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
LLM Safety From Within: Detecting Harmful Content with Internal Representations cs.AI · 2026-04-20 · unverdicted · none · ref 24
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
Linear Representations of Sentiment in Large Language Models cs.LG · 2023-10-23 · unverdicted · none · ref 47
Sentiment is represented as a single linear direction in LLM activation space that is causally relevant across tasks and is summarized at punctuation and names in addition to charged words.
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets cs.AI · 2023-10-10 · unverdicted · none · ref 27
At sufficient scale, LLMs linearly represent the truth value of factual statements, as shown by visualizations, cross-dataset generalization, and causal interventions that flip truth judgments.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 72
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
There Will Be a Scientific Theory of Deep Learning stat.ML · 2026-04-23 · unverdicted · none · ref 90
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer