hub

Linearity of relation decoding in transformer language models

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, David Bau · 2023 · arXiv 2308.09124

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 1 unclear 1

representative citing papers

The Linear Representation Hypothesis and the Geometry of Large Language Models

cs.CL · 2023-11-07 · conditional · novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.

PRISM: Recovering Instruction Sets from Language Model Activations

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.

Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

Introduces a template-controlled difference-in-differences protocol that corrects chat-template confounding when measuring alignment-induced activation shifts in LLMs and recovers the refusal direction with higher fidelity.

Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.

GKnow: Measuring the Entanglement of Gender Bias and Factual Gender

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.

Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.

Cell-Based Representation of Relational Binding in Language Models

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.

Factual Retrieval in LLMs Is a Redundant, Distributed and Non-Contiguous Process

cs.CL · 2026-06-19 · unverdicted · novelty 6.0

Attribute retrieval in LLMs follows non-contiguous, redundant layer paths identified via iterative patching, implying highly distributed knowledge storage.

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

cs.CL · 2026-06-18 · unverdicted · novelty 6.0

LLM representations encode essay quality in a linearly decodable form that emerges across layers and includes identifiable scoring neurons whose distribution shifts with essay length.

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

cs.AI · 2026-06-06 · unverdicted · novelty 6.0

Standard tests for mechanistic roles in transformer attention heads are insufficient because heads that pass them fail to transfer computations across prompts under matched controls.

Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.

Architecture, Not Scale: Circuit Localization in Large Language Models

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.

LLM Safety From Within: Detecting Harmful Content with Internal Representations

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

How Do Language Models Compose Functions?

cs.CL · 2025-10-02 · conditional · novelty 6.0

LLMs solve compositional factual recall either by computing intermediates or directly, with mechanism choice correlated to translation geometry in embedding spaces.

How Training Data Shapes the Use of Parametric and In-Context Knowledge in Language Models

cs.CL · 2025-09-29 · unverdicted · novelty 6.0

Balanced parametric and in-context knowledge use in LLMs is an emergent property requiring intra-document repetition, moderate inconsistency, and skewed distributions in training data.

Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization

cs.CL · 2026-06-02 · unverdicted · novelty 5.0

DCO is an inference-time intervention that decomposes attention head outputs orthogonally to a dynamic context anchor and suppresses outlier components via Z-score to improve contextual faithfulness in Llama models.

Towards Effective Theory of LLMs: A Representation Learning Approach

cs.LG · 2026-05-10 · unverdicted · novelty 5.0

RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

GeoMathCode: Understanding Interleaved Math-Code Reasoning for Geometry Problem Solving

cs.CL · 2026-05-25 · unverdicted · novelty 4.0

GeoMathCode interleaves math reasoning with programmatic code outputs for geometry problems in MLLMs and shows that reasoning steps and hierarchical code structures become disentangled in latent space after SFT.

citing papers explorer

Showing 19 of 19 citing papers.

The Linear Representation Hypothesis and the Geometry of Large Language Models cs.CL · 2023-11-07 · conditional · none · ref 9
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
PRISM: Recovering Instruction Sets from Language Model Activations cs.AI · 2026-06-08 · unverdicted · none · ref 21
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol cs.LG · 2026-05-23 · unverdicted · none · ref 6
Introduces a template-controlled difference-in-differences protocol that corrects chat-template confounding when measuring alignment-induced activation shifts in LLMs and recovers the refusal direction with higher fidelity.
Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces cs.AI · 2026-05-14 · unverdicted · none · ref 64
Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.
GKnow: Measuring the Entanglement of Gender Bias and Factual Gender cs.CL · 2026-05-12 · unverdicted · none · ref 39
Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction cs.AI · 2026-05-04 · unverdicted · none · ref 12
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.
Cell-Based Representation of Relational Binding in Language Models cs.CL · 2026-04-21 · unverdicted · none · ref 22
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.
Factual Retrieval in LLMs Is a Redundant, Distributed and Non-Contiguous Process cs.CL · 2026-06-19 · unverdicted · none · ref 16
Attribute retrieval in LLMs follows non-contiguous, redundant layer paths identified via iterative patching, implying highly distributed knowledge storage.
From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models cs.CL · 2026-06-18 · unverdicted · none · ref 61
LLM representations encode essay quality in a linearly decodable form that emerges across layers and includes identifiable scoring neurons whose distribution shifts with essay length.
Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers cs.AI · 2026-06-06 · unverdicted · none · ref 7
Standard tests for mechanistic roles in transformer attention heads are insufficient because heads that pass them fail to transfer computations across prompts under matched controls.
Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models cs.CL · 2026-05-10 · unverdicted · none · ref 11
Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.
Architecture, Not Scale: Circuit Localization in Large Language Models cs.CL · 2026-05-09 · unverdicted · none · ref 7
Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.
LLM Safety From Within: Detecting Harmful Content with Internal Representations cs.AI · 2026-04-20 · unverdicted · none · ref 36
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
How Do Language Models Compose Functions? cs.CL · 2025-10-02 · conditional · none · ref 14
LLMs solve compositional factual recall either by computing intermediates or directly, with mechanism choice correlated to translation geometry in embedding spaces.
How Training Data Shapes the Use of Parametric and In-Context Knowledge in Language Models cs.CL · 2025-09-29 · unverdicted · none · ref 5
Balanced parametric and in-context knowledge use in LLMs is an emergent property requiring intra-document repetition, moderate inconsistency, and skewed distributions in training data.
Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization cs.CL · 2026-06-02 · unverdicted · none · ref 54
DCO is an inference-time intervention that decomposes attention head outputs orthogonally to a dynamic context anchor and suppresses outlier components via Z-score to improve contextual faithfulness in Llama models.
Towards Effective Theory of LLMs: A Representation Learning Approach cs.LG · 2026-05-10 · unverdicted · none · ref 32
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes cs.AI · 2026-05-07 · unverdicted · none · ref 28
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
GeoMathCode: Understanding Interleaved Math-Code Reasoning for Geometry Problem Solving cs.CL · 2026-05-25 · unverdicted · none · ref 11
GeoMathCode interleaves math reasoning with programmatic code outputs for geometry problems in MLLMs and shows that reasoning steps and hierarchical code structures become disentangled in latent space after SFT.

Linearity of relation decoding in transformer language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer