Recognition: no theorem link
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Pith reviewed 2026-05-12 19:29 UTC · model grok-4.3
The pith
Large language models encode the truth or falsehood of factual statements as a linear direction in their activation space at sufficient scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements. Visualizations reveal clear linear structure in the representations. Probes trained on one dataset generalize to different datasets. Causal interventions along the linear direction in a model's forward pass cause it to treat false statements as true and true statements as false. Simple difference-in-mean probes identify directions that are more causally implicated in model outputs than other probing techniques.
What carries the argument
The truth direction: a vector in activation space separating true from false statements, which linear probes use for classification and which supports direct causal interventions that flip the model's truth judgments.
If this is right
- Probes trained on one true/false dataset generalize to others without additional training.
- Intervening along the linear direction during the forward pass directly alters whether the model outputs true or false responses.
- Basic difference-in-mean probes match the performance of more complex probing methods while being more causally relevant to the outputs.
- The linear representation of truth becomes apparent only once models reach larger scales.
Where Pith is reading between the lines
- If the linear direction is stable, it could be amplified during generation to reduce false outputs in deployed models.
- Similar linear encodings might exist for other factual or logical distinctions beyond simple true/false.
- This structure raises the possibility of using activation edits as a post-training tool for truthfulness without full retraining.
Load-bearing premise
The true/false labels in the chosen datasets reflect genuine truth distinctions rather than being confounded by unrelated surface features such as statement length, topic, or sentiment.
What would settle it
A new collection of true/false statements on previously unseen topics where a probe trained on the original datasets shows no better than chance performance or where activation interventions along the candidate direction leave the model's output probabilities unchanged.
read the original abstract
Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we use high-quality datasets of simple true/false statements to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements. We also show that simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that large language models at sufficient scale linearly represent the truth or falsehood of factual statements. It supports this with three lines of evidence from high-quality true/false statement datasets: (1) visualizations of internal activations showing clear linear structure separating true and false items, (2) transfer experiments where probes trained on one dataset generalize to others, and (3) causal interventions that edit the identified direction in the forward pass to flip the model's treatment of statements as true or false. It further argues that simple difference-in-mean probes match more complex methods in performance while yielding more causally relevant directions.
Significance. If the central claim holds after addressing potential confounds, the work would strengthen evidence for emergent linear representations of abstract concepts like truth in LLMs, with direct implications for mechanistic interpretability and hallucination mitigation. The demonstration that difference-in-mean probes are competitive yet more causally implicated is a practical contribution, and the combination of visualization, transfer, and intervention provides converging evidence that goes beyond correlational probing.
major comments (3)
- [Datasets] Datasets section: The manuscript does not report any explicit balancing, matching, or regression controls for surface features (statement length, lexical frequency, syntactic complexity, or sentiment polarity) between true and false examples. Since visualizations, cross-dataset transfer, and causal interventions all rely on the same family of datasets, systematic differences in these features could produce the observed linear direction without it encoding truth per se.
- [Causal intervention experiments] Causal intervention experiments (likely §5): While editing the direction changes model outputs to treat false statements as true (and vice versa), this shows only that the direction is used in generation; it does not isolate whether its semantic content is truth rather than a correlated surface statistic. An explicit test (e.g., intervening after regressing out length/sentiment) would be needed to support the stronger claim.
- [Transfer results] Transfer results (likely §4.2): Generalization across datasets is reported, but without accompanying statistics confirming that the datasets differ substantially in surface features while sharing only the truth label, the transfer could still be driven by shared confounds rather than a shared truth direction.
minor comments (2)
- [Figures] Figure captions and axis labels in the visualization panels could be expanded to include the exact models and layers used, improving reproducibility.
- [Datasets] The paper would benefit from a short table summarizing dataset statistics (e.g., mean length, sentiment scores) for true vs. false splits to allow readers to assess balance.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which help clarify potential confounds in our evidence for linear truth representations. We address each major point below with specific revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Datasets] Datasets section: The manuscript does not report any explicit balancing, matching, or regression controls for surface features (statement length, lexical frequency, syntactic complexity, or sentiment polarity) between true and false examples. Since visualizations, cross-dataset transfer, and causal interventions all rely on the same family of datasets, systematic differences in these features could produce the observed linear direction without it encoding truth per se.
Authors: We acknowledge that explicit balancing statistics and regression controls for surface features were not reported in the submitted manuscript. The datasets consist of simple factual statements drawn from prior high-quality sources, with true/false pairs differing primarily in factual content. In the revision, we will add a dedicated analysis subsection (with accompanying table and figures) reporting mean/variance statistics for statement length, lexical frequency, syntactic complexity, and sentiment polarity across true and false classes, along with regression-based controls demonstrating that the linear direction and visualizations persist after removing these features. This directly addresses the concern. revision: yes
-
Referee: [Causal intervention experiments] Causal intervention experiments (likely §5): While editing the direction changes model outputs to treat false statements as true (and vice versa), this shows only that the direction is used in generation; it does not isolate whether its semantic content is truth rather than a correlated surface statistic. An explicit test (e.g., intervening after regressing out length/sentiment) would be needed to support the stronger claim.
Authors: The interventions establish that the direction is causally used by the model during generation on these tasks. To more rigorously isolate semantic content from surface statistics, we will add controlled intervention experiments in the revision: we regress out length, sentiment, and related features from activations prior to direction identification and intervention, then report that the truth-flipping effect remains statistically significant. These results will be presented alongside the original interventions. revision: yes
-
Referee: [Transfer results] Transfer results (likely §4.2): Generalization across datasets is reported, but without accompanying statistics confirming that the datasets differ substantially in surface features while sharing only the truth label, the transfer could still be driven by shared confounds rather than a shared truth direction.
Authors: We will expand the transfer section in the revision to include explicit comparative statistics (e.g., pairwise distances or ANOVA results) on surface features across the datasets, confirming substantial differences in length, lexical properties, and syntax while they share only the truth label. We will also report transfer performance after partialling out these features, showing that generalization is driven by the shared truth direction. revision: yes
Circularity Check
No significant circularity; empirical evidence chain is self-contained
full rationale
The paper presents three lines of evidence (visualizations of activations, probe transfer across datasets, and causal interventions on identified directions) rather than a mathematical derivation. Probes are fit to labels, but transfer performance on distinct datasets and the causal editing results (flipping model outputs by adding/subtracting the direction) constitute independent tests that do not reduce to the fitting step by construction. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing steps. The difference-in-mean probe is compared to other methods and shown to be more causally effective, but this is an empirical comparison, not a tautological renaming. The analysis remains within standard supervised probing plus intervention methodology without the forbidden patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Linear probes on LLM activations can recover semantic distinctions such as truth value
Forward citations
Cited by 39 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
Geometric Factual Recall in Transformers
A single-layer transformer memorizes random subject-attribute bijections using logarithmic embedding dimension via linear superpositions in embeddings and ReLU-gated selection in the MLP, with zero-shot transfer to ne...
-
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
-
Deep Minds and Shallow Probes
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
-
Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs
LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.
-
Steer Like the LLM: Activation Steering that Mimics Prompting
PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.
-
Cell-Based Representation of Relational Binding in Language Models
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...
-
Emotion Concepts and their Function in a Large Language Model
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
-
Detecting Multi-Agent Collusion Through Multi-Agent Interpretability
NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.
-
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a stro...
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
-
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.
-
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
-
Architecture, Not Scale: Circuit Localization in Large Language Models
Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.
-
Tool Calling is Linearly Readable and Steerable in Language Models
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
-
Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer
Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.
-
Hallucination Detection via Activations of Open-Weight Proxy Analyzers
A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.
-
Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits
Probabilistic circuits detect LLM hallucinations as residual-stream anomalies with up to 99% AUROC and enable dynamic correction that raises truthfulness scores while cutting unnecessary output corruption.
-
Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
Probe-geometry alignment erases cross-sequence memorization signatures in LLMs below chance using per-depth rank-one activation interventions with negligible impact on zero-shot capabilities.
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
-
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...
-
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out har...
-
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Harmful intent is geometrically recoverable as a linear direction or angular deviation in LLM residual streams, with high AUROC across 12 models, stable under alignment variants including abliterated ones, and transfe...
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
Testing the Limits of Truth Directions in LLMs
Truth directions in LLMs are not universal but depend heavily on model layer, task type and difficulty, and prompt instructions.
-
Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs
Hallucination neurons in LLMs are domain-specific, with cross-domain classifiers dropping from AUROC 0.783 within-domain to 0.563 across domains.
-
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.
-
Exploring Concreteness Through a Figurative Lens
LLMs compress concreteness into a consistent 1D direction in mid-to-late layers that separates literal from figurative noun uses and supports efficient classification plus steering.
-
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
-
H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models
H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.
-
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
-
From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents
A conformal interpretability method labels LLM agent states step-by-step and extracts linearly separable temporal concept directions aligned with task success on ScienceWorld and AlfWorld.
-
Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers
Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.
-
Risk Reporting for Developers' Internal AI Model Use
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[2]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [3]
-
[4]
The Internal State of an LLM Knows When its Lying , author=. 2023 , eprint=
work page 2023
-
[5]
Finding Neurons in a Haystack: Case Studies with Sparse Probing , author=. 2023 , eprint=
work page 2023
-
[6]
The Eleventh International Conference on Learning Representations , year=
Discovering Latent Knowledge in Language Models Without Supervision , author=. The Eleventh International Conference on Learning Representations , year=
-
[7]
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2023 , eprint=
work page 2023
-
[8]
Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks , author=. 2023 , eprint=
work page 2023
-
[9]
What Discovering Latent Knowledge Did and Did Not Find , author=. 2023 , url=
work page 2023
-
[11]
LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=
work page 2023
-
[12]
The Journal of Machine Learning Research , volume=
The implicit bias of gradient descent on separable data , author=. The Journal of Machine Learning Research , volume=. 2018 , publisher=
work page 2018
-
[13]
Linearity of Relation Decoding in Transformer Language Models , author=. 2023 , eprint=
work page 2023
-
[15]
Discovering Language Model Behaviors with Model-Written Evaluations , author=. 2022 , eprint=
work page 2022
-
[16]
Eliciting latent knowledge: How to tell if your eyes deceive you , author=. 2021 , url=
work page 2021
- [17]
-
[18]
Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color , author=. 2021 , eprint=
work page 2021
-
[19]
International Conference on Learning Representations , year=
Mapping Language Models to Grounded Conceptual Spaces , author=. International Conference on Learning Representations , year=
-
[20]
The Eleventh International Conference on Learning Representations , year=
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author=. The Eleventh International Conference on Learning Representations , year=
-
[21]
Bau, David and Zhu, Jun-Yan and Strobelt, Hendrik and Lapedriza, Agata and Zhou, Bolei and Torralba, Antonio , title =. 2020 , doi =
work page 2020
-
[24]
Analyzing Individual Neurons in Pre-trained Language Models
Durrani, Nadir and Sajjad, Hassan and Dalvi, Fahim and Belinkov, Yonatan. Analyzing Individual Neurons in Pre-trained Language Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.395
-
[25]
AAAI Conference on Artificial Intelligence , year=
What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models , author=. AAAI Conference on Artificial Intelligence , year=
-
[26]
Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =
- [27]
-
[28]
Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=
work page 2023
-
[29]
Goh, Gabriel and †, Nick Cammarata and †, Chelsea Voss and Carter, Shan and Petrov, Michael and Schubert, Ludwig and Radford, Alec and Olah, Chris , title =. Distill , year =
-
[30]
Designing and Interpreting Probes with Control Tasks
Hewitt, John and Liang, Percy. Designing and Interpreting Probes with Control Tasks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1275
-
[31]
Probing Classifiers: Promises, Shortcomings, and Advances
Belinkov, Yonatan. Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics. 2022. doi:10.1162/coli_a_00422
work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
-
[32]
Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence , pages =
Pearl, Judea , title =. Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence , pages =. 2001 , isbn =
work page 2001
-
[33]
Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =
Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Singer, Yaron and Shieber, Stuart , booktitle =. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =
-
[34]
Locating and Editing Factual Associations in
Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov , journal=. Locating and Editing Factual Associations in
-
[35]
Explore, Establish, Exploit: Red Teaming Language Models from Scratch , author=. 2023 , eprint=
work page 2023
- [36]
- [37]
-
[38]
AI Deception: A Survey of Examples, Risks, and Potential Solutions , author=. 2023 , eprint=
work page 2023
-
[39]
Understanding intermediate layers using linear classifier probes , author=. 2018 , eprint=
work page 2018
-
[40]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
-
[41]
GPT - N eo X -20 B : An open-source autoregressive language model
Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel. GPT - N eo X -20 B : An Open-Source ...
-
[42]
OPT: Open Pre-trained Transformer Language Models , author=. 2022 , eprint=
work page 2022
-
[43]
Localizing Model Behavior with Path Patching , author=. 2023 , eprint=
work page 2023
-
[44]
Linear Representations of Sentiment in Large Language Models , author=. 2023 , eprint=
work page 2023
- [45]
-
[46]
Measuring Progress on Scalable Oversight for Large Language Models , author=. 2022 , eprint=
work page 2022
-
[47]
Deep reinforcement learning from human preferences , author=. 2023 , eprint=
work page 2023
-
[48]
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , author=. 2023 , eprint=
work page 2023
- [49]
-
[52]
Dissecting Recall of Factual Associations in Auto-Regressive Language Models , author=. 2023 , eprint=
work page 2023
-
[53]
Causal Abstractions of Neural Networks , booktitle =
Atticus Geiger and Hanson Lu and Thomas Icard and Christopher Potts , editor =. Causal Abstractions of Neural Networks , booktitle =. 2021 , url =
work page 2021
-
[54]
Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=
work page 2023
-
[56]
Steering Llama 2 via Contrastive Activation Addition , author=. 2024 , eprint=
work page 2024
-
[57]
Yasumasa Onoe and Michael JQ Zhang and Eunsol Choi and Greg Durrett , booktitle=. 2021 , url=
work page 2021
-
[58]
Nora Belrose and David Schneider-Joseph and Shauli Ravfogel and Ryan Cotterell and Edward Raff and Stella Biderman , booktitle=. 2023 , url=
work page 2023
-
[59]
Can language models encode perceptual structure without grounding? a case study in color, 2021
Mostafa Abdou, Artur Kulmizev, Daniel Hershcovich, Stella Frank, Ellie Pavlick, and Anders Søgaard. Can language models encode perceptual structure without grounding? a case study in color, 2021
work page 2021
-
[60]
Understanding intermediate layers using linear classifier probes, 2018
Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, 2018
work page 2018
-
[61]
The internal state of an llm knows when its lying, 2023
Amos Azaria and Tom Mitchell. The internal state of an llm knows when its lying, 2023
work page 2023
-
[62]
David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 2020. ISSN 0027-8424. doi:10.1073/pnas.1907375117. URL https://www.pnas.org/content/early/2020/08/31/1907375117
-
[63]
LEACE : Perfect linear concept erasure in closed form
Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE : Perfect linear concept erasure in closed form. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=awIpKpwTwF
work page 2023
-
[64]
Discovering latent knowledge in language models without supervision
Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ETKGuby0hcs
work page 2023
-
[65]
Explore, establish, exploit: Red teaming language models from scratch, 2023
Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, and Dylan Hadfield-Menell. Explore, establish, exploit: Red teaming language models from scratch, 2023
work page 2023
-
[66]
Eliciting latent knowledge: How to tell if your eyes deceive you, 2021
Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you, 2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.jrzi4atzacns
work page 2021
-
[67]
Sparse autoencoders find highly interpretable features in language models, 2023
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023
work page 2023
-
[68]
Fahim Dalvi, Nadir Durrani, Hassan Sajjad, Yonatan Belinkov, Anthony Bau, and James R. Glass. What is one grain of sand in the desert? analyzing individual neurons in deep nlp models. In AAAI Conference on Artificial Intelligence, 2018. URL https://api.semanticscholar.org/CorpusID:56895415
work page 2018
-
[69]
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022
work page 2022
-
[70]
Causal analysis of syntactic agreement mechanisms in neural language models
Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internat...
-
[71]
R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7 0 (2): 0 179--188, 1936. doi:https://doi.org/10.1111/j.1469-1809.1936.tb02137.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1936.tb02137.x
-
[72]
Neural natural language inference models partially embed theories of lexical entailment and negation
Atticus Geiger, Kyle Richardson, and Christopher Potts. Neural natural language inference models partially embed theories of lexical entailment and negation. In Afra Alishahi, Yonatan Belinkov, Grzegorz Chrupala, Dieuwke Hupkes, Yuval Pinter, and Hassan Sajjad (eds.), Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Netwo...
-
[73]
Causal abstractions of neural networks
Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December ...
work page 2021
-
[74]
All cities with a population > 1000, 2023
Geonames. All cities with a population > 1000, 2023. URL https://download.geonames.org/export/dump/
work page 2023
-
[75]
Dissecting recall of factual associations in auto-regressive language models, 2023
Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models, 2023
work page 2023
-
[76]
Multimodal neurons in artificial neural networks
Gabriel Goh, Nick Cammarata †, Chelsea Voss †, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal neurons in artificial neural networks. Distill, 2021. doi:10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons
-
[77]
Finding neurons in a haystack: Case studies with sparse probing, 2023
Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing, 2023
work page 2023
-
[78]
B. A. Levinstein and Daniel A. Herrmann. Still no lie detector for language models: Probing empirical and conceptual roadblocks, 2023
work page 2023
-
[79]
Li, Maxwell Nye, and Jacob Andreas
Belinda Z. Li, Maxwell Nye, and Jacob Andreas. Implicit representations of meaning in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 1813--1827, Online, August 2021. Association for Co...
-
[80]
Emergent world representations: Exploring a sequence model trained on a synthetic task
Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Vi \'e gas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations, 2023 a . URL https://openreview.net/forum?id=DeG07_TcZvT
work page 2023
-
[81]
Inference-time intervention: Eliciting truthful answers from a language model, 2023 b
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model, 2023 b
work page 2023
-
[82]
Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap, volume
Stephanie Lin, Jacob Hilton, and Owain Evans. T ruthful QA : Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3214--3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.229. URL https://a...
-
[83]
Locating and editing factual associations in GPT
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . Advances in Neural Information Processing Systems, 36, 2022
work page 2022
-
[84]
CREAK : A dataset for commonsense reasoning over entity knowledge
Yasumasa Onoe, Michael JQ Zhang, Eunsol Choi, and Greg Durrett. CREAK : A dataset for commonsense reasoning over entity knowledge. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=mbW_GT3ZN-
work page 2021
- [85]
-
[86]
Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks
Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks. Ai deception: A survey of examples, risks, and potential solutions, 2023
work page 2023
-
[87]
Mapping language models to grounded conceptual spaces
Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gJcEM8sxHK
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.