arxiv: 2310.06824 · v3 · submitted 2023-10-10 · 💻 cs.AI

Recognition: no theorem link

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Max Tegmark, Samuel Marks

Authors on Pith no claims yet

Pith reviewed 2026-05-12 19:29 UTC · model grok-4.3

classification 💻 cs.AI

keywords large language modelstruth representationlinear probesmodel activationscausal interventionsfactual statementshallucination detection

0 comments

The pith

Large language models encode the truth or falsehood of factual statements as a linear direction in their activation space at sufficient scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the internal activations of large language models when processing simple true and false statements from high-quality datasets. Visualizations of these activations show a clear linear separation between true and false examples. Probes trained on one dataset transfer successfully to others, and editing activations along the identified direction can make the model treat false statements as true and vice versa. Readers interested in AI reliability would care because this structure points to a potential mechanism for detecting and altering whether models produce accurate outputs.

Core claim

At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements. Visualizations reveal clear linear structure in the representations. Probes trained on one dataset generalize to different datasets. Causal interventions along the linear direction in a model's forward pass cause it to treat false statements as true and true statements as false. Simple difference-in-mean probes identify directions that are more causally implicated in model outputs than other probing techniques.

What carries the argument

The truth direction: a vector in activation space separating true from false statements, which linear probes use for classification and which supports direct causal interventions that flip the model's truth judgments.

If this is right

Probes trained on one true/false dataset generalize to others without additional training.
Intervening along the linear direction during the forward pass directly alters whether the model outputs true or false responses.
Basic difference-in-mean probes match the performance of more complex probing methods while being more causally relevant to the outputs.
The linear representation of truth becomes apparent only once models reach larger scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the linear direction is stable, it could be amplified during generation to reduce false outputs in deployed models.
Similar linear encodings might exist for other factual or logical distinctions beyond simple true/false.
This structure raises the possibility of using activation edits as a post-training tool for truthfulness without full retraining.

Load-bearing premise

The true/false labels in the chosen datasets reflect genuine truth distinctions rather than being confounded by unrelated surface features such as statement length, topic, or sentiment.

What would settle it

A new collection of true/false statements on previously unseen topics where a probe trained on the original datasets shows no better than chance performance or where activation interventions along the candidate direction leave the model's output probabilities unchanged.

read the original abstract

Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we use high-quality datasets of simple true/false statements to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements. We also show that simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs develop a linear direction for truth vs falsehood on simple statements, with the causal interventions providing the most direct evidence that the direction affects outputs.

read the letter

Hey, the main thing here is that at larger scales these models seem to have a linear direction in their activations that tracks whether statements are true or false, and the authors support that with visualizations, cross-dataset transfer, and especially by editing the activations to flip the model's judgments on factual claims. The causal interventions stand out because they go beyond correlation: shifting along the identified direction during the forward pass makes the model treat false statements as true and the reverse, which ties the representation more closely to actual behavior. They also show that a plain difference-in-means probe matches or beats fancier methods while being more causally implicated. That combination is what feels new compared to earlier probing work. The soft spot is the dataset construction. All the evidence rests on families of simple true/false statements, so if those sets differ systematically in length, word frequency, syntax, or sentiment, the linear direction could be capturing one of those proxies instead of truth itself. The transfer tests and edits would still go through, but they wouldn't prove the semantic content is truth. The abstract calls the datasets high-quality, yet without seeing the exact balancing or controls applied, it's reasonable to wonder how cleanly the direction isolates the intended variable. This is aimed at people working on LLM interpretability and internal monitoring for truthfulness. The experiments are concrete and the methods look straightforward enough to reproduce, so it deserves to go to peer review rather than a desk reject. I'd bring it to a reading group to walk through the intervention details and check the dataset stats.

Referee Report

3 major / 2 minor

Summary. The paper claims that large language models at sufficient scale linearly represent the truth or falsehood of factual statements. It supports this with three lines of evidence from high-quality true/false statement datasets: (1) visualizations of internal activations showing clear linear structure separating true and false items, (2) transfer experiments where probes trained on one dataset generalize to others, and (3) causal interventions that edit the identified direction in the forward pass to flip the model's treatment of statements as true or false. It further argues that simple difference-in-mean probes match more complex methods in performance while yielding more causally relevant directions.

Significance. If the central claim holds after addressing potential confounds, the work would strengthen evidence for emergent linear representations of abstract concepts like truth in LLMs, with direct implications for mechanistic interpretability and hallucination mitigation. The demonstration that difference-in-mean probes are competitive yet more causally implicated is a practical contribution, and the combination of visualization, transfer, and intervention provides converging evidence that goes beyond correlational probing.

major comments (3)

[Datasets] Datasets section: The manuscript does not report any explicit balancing, matching, or regression controls for surface features (statement length, lexical frequency, syntactic complexity, or sentiment polarity) between true and false examples. Since visualizations, cross-dataset transfer, and causal interventions all rely on the same family of datasets, systematic differences in these features could produce the observed linear direction without it encoding truth per se.
[Causal intervention experiments] Causal intervention experiments (likely §5): While editing the direction changes model outputs to treat false statements as true (and vice versa), this shows only that the direction is used in generation; it does not isolate whether its semantic content is truth rather than a correlated surface statistic. An explicit test (e.g., intervening after regressing out length/sentiment) would be needed to support the stronger claim.
[Transfer results] Transfer results (likely §4.2): Generalization across datasets is reported, but without accompanying statistics confirming that the datasets differ substantially in surface features while sharing only the truth label, the transfer could still be driven by shared confounds rather than a shared truth direction.

minor comments (2)

[Figures] Figure captions and axis labels in the visualization panels could be expanded to include the exact models and layers used, improving reproducibility.
[Datasets] The paper would benefit from a short table summarizing dataset statistics (e.g., mean length, sentiment scores) for true vs. false splits to allow readers to assess balance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify potential confounds in our evidence for linear truth representations. We address each major point below with specific revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Datasets] Datasets section: The manuscript does not report any explicit balancing, matching, or regression controls for surface features (statement length, lexical frequency, syntactic complexity, or sentiment polarity) between true and false examples. Since visualizations, cross-dataset transfer, and causal interventions all rely on the same family of datasets, systematic differences in these features could produce the observed linear direction without it encoding truth per se.

Authors: We acknowledge that explicit balancing statistics and regression controls for surface features were not reported in the submitted manuscript. The datasets consist of simple factual statements drawn from prior high-quality sources, with true/false pairs differing primarily in factual content. In the revision, we will add a dedicated analysis subsection (with accompanying table and figures) reporting mean/variance statistics for statement length, lexical frequency, syntactic complexity, and sentiment polarity across true and false classes, along with regression-based controls demonstrating that the linear direction and visualizations persist after removing these features. This directly addresses the concern. revision: yes
Referee: [Causal intervention experiments] Causal intervention experiments (likely §5): While editing the direction changes model outputs to treat false statements as true (and vice versa), this shows only that the direction is used in generation; it does not isolate whether its semantic content is truth rather than a correlated surface statistic. An explicit test (e.g., intervening after regressing out length/sentiment) would be needed to support the stronger claim.

Authors: The interventions establish that the direction is causally used by the model during generation on these tasks. To more rigorously isolate semantic content from surface statistics, we will add controlled intervention experiments in the revision: we regress out length, sentiment, and related features from activations prior to direction identification and intervention, then report that the truth-flipping effect remains statistically significant. These results will be presented alongside the original interventions. revision: yes
Referee: [Transfer results] Transfer results (likely §4.2): Generalization across datasets is reported, but without accompanying statistics confirming that the datasets differ substantially in surface features while sharing only the truth label, the transfer could still be driven by shared confounds rather than a shared truth direction.

Authors: We will expand the transfer section in the revision to include explicit comparative statistics (e.g., pairwise distances or ANOVA results) on surface features across the datasets, confirming substantial differences in length, lexical properties, and syntax while they share only the truth label. We will also report transfer performance after partialling out these features, showing that generalization is driven by the shared truth direction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evidence chain is self-contained

full rationale

The paper presents three lines of evidence (visualizations of activations, probe transfer across datasets, and causal interventions on identified directions) rather than a mathematical derivation. Probes are fit to labels, but transfer performance on distinct datasets and the causal editing results (flipping model outputs by adding/subtracting the direction) constitute independent tests that do not reduce to the fitting step by construction. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing steps. The difference-in-mean probe is compared to other methods and shown to be more causally effective, but this is an empirical comparison, not a tautological renaming. The analysis remains within standard supervised probing plus intervention methodology without the forbidden patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that linear directions in activation space can capture semantic properties such as truth, plus the practical assumption that the curated true/false datasets isolate truth from other correlated features.

axioms (1)

domain assumption Linear probes on LLM activations can recover semantic distinctions such as truth value
Invoked by the decision to train and evaluate linear classifiers on internal representations

pith-pipeline@v0.9.0 · 5493 in / 1097 out tokens · 43912 ms · 2026-05-12T19:29:26.086618+00:00 · methodology

discussion (0)

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
Geometric Factual Recall in Transformers
cs.CL 2026-05 conditional novelty 8.0

A single-layer transformer memorizes random subject-attribute bijections using logarithmic embedding dimension via linear superpositions in embeddings and ReLU-gated selection in the MLP, with zero-shot transfer to ne...
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
cs.LG 2026-04 accept novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
Deep Minds and Shallow Probes
cs.LG 2026-05 unverdicted novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs
cs.CL 2026-05 unverdicted novelty 7.0

LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.
Steer Like the LLM: Activation Steering that Mimics Prompting
cs.CL 2026-05 unverdicted novelty 7.0

PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.
Cell-Based Representation of Relational Binding in Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...
Emotion Concepts and their Function in a Large Language Model
cs.AI 2026-04 unverdicted novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Detecting Multi-Agent Collusion Through Multi-Agent Interpretability
cs.AI 2026-04 conditional novelty 7.0

NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
cs.LG 2026-03 unverdicted novelty 7.0

The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a stro...
Refusal in Language Models Is Mediated by a Single Direction
cs.LG 2024-06 accept novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
cs.AI 2026-05 unverdicted novelty 6.0

Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
cs.AI 2026-05 unverdicted novelty 6.0

SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
cs.AI 2026-05 unverdicted novelty 6.0

Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
Architecture, Not Scale: Circuit Localization in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.
Tool Calling is Linearly Readable and Steerable in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer
cs.LG 2026-05 unverdicted novelty 6.0

Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.
Hallucination Detection via Activations of Open-Weight Proxy Analyzers
cs.CL 2026-05 unverdicted novelty 6.0

A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.
Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits
cs.CL 2026-05 unverdicted novelty 6.0

Probabilistic circuits detect LLM hallucinations as residual-stream anomalies with up to 99% AUROC and enable dynamic correction that raises truthfulness scores while cutting unnecessary output corruption.
Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
cs.LG 2026-05 unverdicted novelty 6.0

Probe-geometry alignment erases cross-sequence memorization signatures in LLMs below chance using per-depth rank-one activation interventions with negligible impact on zero-shot capabilities.
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
cs.LG 2026-04 unverdicted novelty 6.0

Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out har...
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
cs.LG 2026-04 unverdicted novelty 6.0

Harmful intent is geometrically recoverable as a linear direction or angular deviation in LLM residual streams, with high AUROC across 12 models, stable under alignment variants including abliterated ones, and transfe...
LLM Safety From Within: Detecting Harmful Content with Internal Representations
cs.AI 2026-04 unverdicted novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
Testing the Limits of Truth Directions in LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Truth directions in LLMs are not universal but depend heavily on model layer, task type and difficulty, and prompt instructions.
Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs
cs.CL 2026-03 unverdicted novelty 6.0

Hallucination neurons in LLMs are domain-specific, with cross-domain classifiers dropping from AUROC 0.783 within-domain to 0.563 across domains.
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
cs.AI 2026-05 unverdicted novelty 5.0

HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
cs.AI 2026-05 unverdicted novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.
Exploring Concreteness Through a Figurative Lens
cs.CL 2026-04 unverdicted novelty 5.0

LLMs compress concreteness into a consistent 1D direction in mid-to-late layers that separates literal from figurative noun uses and supports efficient classification plus steering.
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models
cs.CL 2026-04 unverdicted novelty 5.0

H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents
cs.AI 2026-03 unverdicted novelty 5.0

A conformal interpretability method labels LLM agent states step-by-step and extracts linearly separable temporal concept directions aligned with task success on ScienceWorld and AlfWorld.
Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers
cs.CV 2026-05 unverdicted novelty 4.0

Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.
Risk Reporting for Developers' Internal AI Model Use
cs.CY 2026-04 unverdicted novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 38 Pith papers · 1 internal anchor

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

2023 , eprint=

The Internal State of an LLM Knows When its Lying , author=. 2023 , eprint=

work page 2023
[5]

2023 , eprint=

Finding Neurons in a Haystack: Case Studies with Sparse Probing , author=. 2023 , eprint=

work page 2023
[6]

The Eleventh International Conference on Learning Representations , year=

Discovering Latent Knowledge in Language Models Without Supervision , author=. The Eleventh International Conference on Learning Representations , year=

work page
[7]

2023 , eprint=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2023 , eprint=

work page 2023
[8]

2023 , eprint=

Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks , author=. 2023 , eprint=

work page 2023
[9]

2023 , url=

What Discovering Latent Knowledge Did and Did Not Find , author=. 2023 , url=

work page 2023
[11]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[12]

The Journal of Machine Learning Research , volume=

The implicit bias of gradient descent on separable data , author=. The Journal of Machine Learning Research , volume=. 2018 , publisher=

work page 2018
[13]

2023 , eprint=

Linearity of Relation Decoding in Transformer Language Models , author=. 2023 , eprint=

work page 2023
[15]

2022 , eprint=

Discovering Language Model Behaviors with Model-Written Evaluations , author=. 2022 , eprint=

work page 2022
[16]

2021 , url=

Eliciting latent knowledge: How to tell if your eyes deceive you , author=. 2021 , url=

work page 2021
[17]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[18]

2021 , eprint=

Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color , author=. 2021 , eprint=

work page 2021
[19]

International Conference on Learning Representations , year=

Mapping Language Models to Grounded Conceptual Spaces , author=. International Conference on Learning Representations , year=

work page
[20]

The Eleventh International Conference on Learning Representations , year=

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author=. The Eleventh International Conference on Learning Representations , year=

work page
[21]

2020 , doi =

Bau, David and Zhu, Jun-Yan and Strobelt, Hendrik and Lapedriza, Agata and Zhou, Bolei and Torralba, Antonio , title =. 2020 , doi =

work page 2020
[24]

Analyzing Individual Neurons in Pre-trained Language Models

Durrani, Nadir and Sajjad, Hassan and Dalvi, Fahim and Belinkov, Yonatan. Analyzing Individual Neurons in Pre-trained Language Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.395

work page doi:10.18653/v1/2020.emnlp-main.395 2020
[25]

AAAI Conference on Artificial Intelligence , year=

What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models , author=. AAAI Conference on Artificial Intelligence , year=

work page
[26]

Distill , year =

Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

work page
[27]

2022 , journal=

Toy Models of Superposition , author=. 2022 , journal=

work page 2022
[28]

2023 , eprint=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

work page 2023
[29]

Distill , year =

Goh, Gabriel and †, Nick Cammarata and †, Chelsea Voss and Carter, Shan and Petrov, Michael and Schubert, Ludwig and Radford, Alec and Olah, Chris , title =. Distill , year =

work page
[30]

Designing and Interpreting Probes with Control Tasks

Hewitt, John and Liang, Percy. Designing and Interpreting Probes with Control Tasks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1275

work page doi:10.18653/v1/d19-1275 2019
[31]

Probing Classifiers: Promises, Shortcomings, and Advances

Belinkov, Yonatan. Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics. 2022. doi:10.1162/coli_a_00422

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[32]

Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence , pages =

Pearl, Judea , title =. Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence , pages =. 2001 , isbn =

work page 2001
[33]

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Singer, Yaron and Shieber, Stuart , booktitle =. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

work page
[34]

Locating and Editing Factual Associations in

Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov , journal=. Locating and Editing Factual Associations in

work page
[35]

2023 , eprint=

Explore, Establish, Exploit: Red Teaming Language Models from Scratch , author=. 2023 , eprint=

work page 2023
[36]

All Cities with a population

Geonames , year =. All Cities with a population

work page
[37]

2023 , url=

Emergent Deception and Emergent Optimization , author=. 2023 , url=

work page 2023
[38]

2023 , eprint=

AI Deception: A Survey of Examples, Risks, and Potential Solutions , author=. 2023 , eprint=

work page 2023
[39]

2018 , eprint=

Understanding intermediate layers using linear classifier probes , author=. 2018 , eprint=

work page 2018
[40]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[41]

GPT - N eo X -20 B : An open-source autoregressive language model

Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel. GPT - N eo X -20 B : An Open-Source ...

work page doi:10.18653/v1/2022.bigscience-1.9 2022
[42]

2022 , eprint=

OPT: Open Pre-trained Transformer Language Models , author=. 2022 , eprint=

work page 2022
[43]

2023 , eprint=

Localizing Model Behavior with Path Patching , author=. 2023 , eprint=

work page 2023
[44]

2023 , eprint=

Linear Representations of Sentiment in Large Language Models , author=. 2023 , eprint=

work page 2023
[45]

2016 , eprint=

Concrete Problems in AI Safety , author=. 2016 , eprint=

work page 2016
[46]

2022 , eprint=

Measuring Progress on Scalable Oversight for Large Language Models , author=. 2022 , eprint=

work page 2022
[47]

2023 , eprint=

Deep reinforcement learning from human preferences , author=. 2023 , eprint=

work page 2023
[48]

2023 , eprint=

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , author=. 2023 , eprint=

work page 2023
[49]

2015 , url =

Collaborative data science , publisher =. 2015 , url =

work page 2015
[52]

2023 , eprint=

Dissecting Recall of Factual Associations in Auto-Regressive Language Models , author=. 2023 , eprint=

work page 2023
[53]

Causal Abstractions of Neural Networks , booktitle =

Atticus Geiger and Hanson Lu and Thomas Icard and Christopher Potts , editor =. Causal Abstractions of Neural Networks , booktitle =. 2021 , url =

work page 2021
[54]

2023 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

work page 2023
[56]

2024 , eprint=

Steering Llama 2 via Contrastive Activation Addition , author=. 2024 , eprint=

work page 2024
[57]

2021 , url=

Yasumasa Onoe and Michael JQ Zhang and Eunsol Choi and Greg Durrett , booktitle=. 2021 , url=

work page 2021
[58]

2023 , url=

Nora Belrose and David Schneider-Joseph and Shauli Ravfogel and Ryan Cotterell and Edward Raff and Stella Biderman , booktitle=. 2023 , url=

work page 2023
[59]

Can language models encode perceptual structure without grounding? a case study in color, 2021

Mostafa Abdou, Artur Kulmizev, Daniel Hershcovich, Stella Frank, Ellie Pavlick, and Anders Søgaard. Can language models encode perceptual structure without grounding? a case study in color, 2021

work page 2021
[60]

Understanding intermediate layers using linear classifier probes, 2018

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, 2018

work page 2018
[61]

The internal state of an llm knows when its lying, 2023

Amos Azaria and Tom Mitchell. The internal state of an llm knows when its lying, 2023

work page 2023
[62]

a is b” fail to learn “b is a

David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 2020. ISSN 0027-8424. doi:10.1073/pnas.1907375117. URL https://www.pnas.org/content/early/2020/08/31/1907375117

work page doi:10.1073/pnas.1907375117 2020
[63]

LEACE : Perfect linear concept erasure in closed form

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE : Perfect linear concept erasure in closed form. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=awIpKpwTwF

work page 2023
[64]

Discovering latent knowledge in language models without supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ETKGuby0hcs

work page 2023
[65]

Explore, establish, exploit: Red teaming language models from scratch, 2023

Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, and Dylan Hadfield-Menell. Explore, establish, exploit: Red teaming language models from scratch, 2023

work page 2023
[66]

Eliciting latent knowledge: How to tell if your eyes deceive you, 2021

Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you, 2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.jrzi4atzacns

work page 2021
[67]

Sparse autoencoders find highly interpretable features in language models, 2023

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023

work page 2023
[68]

Fahim Dalvi, Nadir Durrani, Hassan Sajjad, Yonatan Belinkov, Anthony Bau, and James R. Glass. What is one grain of sand in the desert? analyzing individual neurons in deep nlp models. In AAAI Conference on Artificial Intelligence, 2018. URL https://api.semanticscholar.org/CorpusID:56895415

work page 2018
[69]

Toy models of superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022

work page 2022
[70]

Causal analysis of syntactic agreement mechanisms in neural language models

Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internat...

work page doi:10.18653/v1/2021.acl-long.144 2021
[71]

R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7 0 (2): 0 179--188, 1936. doi:https://doi.org/10.1111/j.1469-1809.1936.tb02137.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1936.tb02137.x

work page doi:10.1111/j.1469-1809.1936.tb02137.x 1936
[72]

Neural natural language inference models partially embed theories of lexical entailment and negation

Atticus Geiger, Kyle Richardson, and Christopher Potts. Neural natural language inference models partially embed theories of lexical entailment and negation. In Afra Alishahi, Yonatan Belinkov, Grzegorz Chrupala, Dieuwke Hupkes, Yuval Pinter, and Hassan Sajjad (eds.), Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Netwo...

work page doi:10.18653/v1/2020.blackboxnlp-1.16 2020
[73]

Causal abstractions of neural networks

Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December ...

work page 2021
[74]

All cities with a population > 1000, 2023

Geonames. All cities with a population > 1000, 2023. URL https://download.geonames.org/export/dump/

work page 2023
[75]

Dissecting recall of factual associations in auto-regressive language models, 2023

Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models, 2023

work page 2023
[76]

Multimodal neurons in artificial neural networks

Gabriel Goh, Nick Cammarata †, Chelsea Voss †, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal neurons in artificial neural networks. Distill, 2021. doi:10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons

work page doi:10.23915/distill.00030 2021
[77]

Finding neurons in a haystack: Case studies with sparse probing, 2023

Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing, 2023

work page 2023
[78]

B. A. Levinstein and Daniel A. Herrmann. Still no lie detector for language models: Probing empirical and conceptual roadblocks, 2023

work page 2023
[79]

Li, Maxwell Nye, and Jacob Andreas

Belinda Z. Li, Maxwell Nye, and Jacob Andreas. Implicit representations of meaning in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 1813--1827, Online, August 2021. Association for Co...

work page doi:10.18653/v1/2021.acl-long.143 2021
[80]

Emergent world representations: Exploring a sequence model trained on a synthetic task

Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Vi \'e gas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations, 2023 a . URL https://openreview.net/forum?id=DeG07_TcZvT

work page 2023
[81]

Inference-time intervention: Eliciting truthful answers from a language model, 2023 b

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model, 2023 b

work page 2023
[82]

Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap, volume

Stephanie Lin, Jacob Hilton, and Owain Evans. T ruthful QA : Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3214--3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.229. URL https://a...

work page doi:10.18653/v1/2022.acl-long.229 2022
[83]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . Advances in Neural Information Processing Systems, 36, 2022

work page 2022
[84]

CREAK : A dataset for commonsense reasoning over entity knowledge

Yasumasa Onoe, Michael JQ Zhang, Eunsol Choi, and Greg Durrett. CREAK : A dataset for commonsense reasoning over entity knowledge. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=mbW_GT3ZN-

work page 2021
[85]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[86]

Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks

Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks. Ai deception: A survey of examples, risks, and potential solutions, 2023

work page 2023
[87]

Mapping language models to grounded conceptual spaces

Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gJcEM8sxHK

work page 2022

Showing first 80 references.