Emergent linear representations in world models of self-supervised sequence models

· 2023 · arXiv 2309.00941

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.

Tool Calling is Linearly Readable and Steerable in Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

Predicting Where Steering Vectors Succeed

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

The Linear Accessibility Profile predicts steering vector effectiveness and optimal layers with Spearman correlations of 0.86-0.91 using unembedding projections on intermediate states across multiple models and concepts.

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

citing papers explorer

Showing 7 of 7 citing papers.

Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions cs.LG · 2026-05-11 · unverdicted · none · ref 18
Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 164
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 246
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space cs.CL · 2026-05-12 · unverdicted · none · ref 21
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Tool Calling is Linearly Readable and Steerable in Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 62
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
Predicting Where Steering Vectors Succeed cs.LG · 2026-04-16 · unverdicted · none · ref 8
The Linear Accessibility Profile predicts steering vector effectiveness and optimal layers with Spearman correlations of 0.86-0.91 using unembedding projections on intermediate states across multiple models and concepts.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes cs.AI · 2026-05-07 · unverdicted · none · ref 50
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

Emergent linear representations in world models of self-supervised sequence models

fields

years

verdicts

representative citing papers

citing papers explorer