pith. machine review for the scientific record. sign in

arxiv: 2310.06824 · v3 · submitted 2023-10-10 · 💻 cs.AI

Recognition: no theorem link

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Max Tegmark, Samuel Marks

Authors on Pith no claims yet

Pith reviewed 2026-05-12 19:29 UTC · model grok-4.3

classification 💻 cs.AI
keywords large language modelstruth representationlinear probesmodel activationscausal interventionsfactual statementshallucination detection
0
0 comments X

The pith

Large language models encode the truth or falsehood of factual statements as a linear direction in their activation space at sufficient scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the internal activations of large language models when processing simple true and false statements from high-quality datasets. Visualizations of these activations show a clear linear separation between true and false examples. Probes trained on one dataset transfer successfully to others, and editing activations along the identified direction can make the model treat false statements as true and vice versa. Readers interested in AI reliability would care because this structure points to a potential mechanism for detecting and altering whether models produce accurate outputs.

Core claim

At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements. Visualizations reveal clear linear structure in the representations. Probes trained on one dataset generalize to different datasets. Causal interventions along the linear direction in a model's forward pass cause it to treat false statements as true and true statements as false. Simple difference-in-mean probes identify directions that are more causally implicated in model outputs than other probing techniques.

What carries the argument

The truth direction: a vector in activation space separating true from false statements, which linear probes use for classification and which supports direct causal interventions that flip the model's truth judgments.

If this is right

  • Probes trained on one true/false dataset generalize to others without additional training.
  • Intervening along the linear direction during the forward pass directly alters whether the model outputs true or false responses.
  • Basic difference-in-mean probes match the performance of more complex probing methods while being more causally relevant to the outputs.
  • The linear representation of truth becomes apparent only once models reach larger scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the linear direction is stable, it could be amplified during generation to reduce false outputs in deployed models.
  • Similar linear encodings might exist for other factual or logical distinctions beyond simple true/false.
  • This structure raises the possibility of using activation edits as a post-training tool for truthfulness without full retraining.

Load-bearing premise

The true/false labels in the chosen datasets reflect genuine truth distinctions rather than being confounded by unrelated surface features such as statement length, topic, or sentiment.

What would settle it

A new collection of true/false statements on previously unseen topics where a probe trained on the original datasets shows no better than chance performance or where activation interventions along the candidate direction leave the model's output probabilities unchanged.

read the original abstract

Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we use high-quality datasets of simple true/false statements to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements. We also show that simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that large language models at sufficient scale linearly represent the truth or falsehood of factual statements. It supports this with three lines of evidence from high-quality true/false statement datasets: (1) visualizations of internal activations showing clear linear structure separating true and false items, (2) transfer experiments where probes trained on one dataset generalize to others, and (3) causal interventions that edit the identified direction in the forward pass to flip the model's treatment of statements as true or false. It further argues that simple difference-in-mean probes match more complex methods in performance while yielding more causally relevant directions.

Significance. If the central claim holds after addressing potential confounds, the work would strengthen evidence for emergent linear representations of abstract concepts like truth in LLMs, with direct implications for mechanistic interpretability and hallucination mitigation. The demonstration that difference-in-mean probes are competitive yet more causally implicated is a practical contribution, and the combination of visualization, transfer, and intervention provides converging evidence that goes beyond correlational probing.

major comments (3)
  1. [Datasets] Datasets section: The manuscript does not report any explicit balancing, matching, or regression controls for surface features (statement length, lexical frequency, syntactic complexity, or sentiment polarity) between true and false examples. Since visualizations, cross-dataset transfer, and causal interventions all rely on the same family of datasets, systematic differences in these features could produce the observed linear direction without it encoding truth per se.
  2. [Causal intervention experiments] Causal intervention experiments (likely §5): While editing the direction changes model outputs to treat false statements as true (and vice versa), this shows only that the direction is used in generation; it does not isolate whether its semantic content is truth rather than a correlated surface statistic. An explicit test (e.g., intervening after regressing out length/sentiment) would be needed to support the stronger claim.
  3. [Transfer results] Transfer results (likely §4.2): Generalization across datasets is reported, but without accompanying statistics confirming that the datasets differ substantially in surface features while sharing only the truth label, the transfer could still be driven by shared confounds rather than a shared truth direction.
minor comments (2)
  1. [Figures] Figure captions and axis labels in the visualization panels could be expanded to include the exact models and layers used, improving reproducibility.
  2. [Datasets] The paper would benefit from a short table summarizing dataset statistics (e.g., mean length, sentiment scores) for true vs. false splits to allow readers to assess balance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify potential confounds in our evidence for linear truth representations. We address each major point below with specific revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Datasets] Datasets section: The manuscript does not report any explicit balancing, matching, or regression controls for surface features (statement length, lexical frequency, syntactic complexity, or sentiment polarity) between true and false examples. Since visualizations, cross-dataset transfer, and causal interventions all rely on the same family of datasets, systematic differences in these features could produce the observed linear direction without it encoding truth per se.

    Authors: We acknowledge that explicit balancing statistics and regression controls for surface features were not reported in the submitted manuscript. The datasets consist of simple factual statements drawn from prior high-quality sources, with true/false pairs differing primarily in factual content. In the revision, we will add a dedicated analysis subsection (with accompanying table and figures) reporting mean/variance statistics for statement length, lexical frequency, syntactic complexity, and sentiment polarity across true and false classes, along with regression-based controls demonstrating that the linear direction and visualizations persist after removing these features. This directly addresses the concern. revision: yes

  2. Referee: [Causal intervention experiments] Causal intervention experiments (likely §5): While editing the direction changes model outputs to treat false statements as true (and vice versa), this shows only that the direction is used in generation; it does not isolate whether its semantic content is truth rather than a correlated surface statistic. An explicit test (e.g., intervening after regressing out length/sentiment) would be needed to support the stronger claim.

    Authors: The interventions establish that the direction is causally used by the model during generation on these tasks. To more rigorously isolate semantic content from surface statistics, we will add controlled intervention experiments in the revision: we regress out length, sentiment, and related features from activations prior to direction identification and intervention, then report that the truth-flipping effect remains statistically significant. These results will be presented alongside the original interventions. revision: yes

  3. Referee: [Transfer results] Transfer results (likely §4.2): Generalization across datasets is reported, but without accompanying statistics confirming that the datasets differ substantially in surface features while sharing only the truth label, the transfer could still be driven by shared confounds rather than a shared truth direction.

    Authors: We will expand the transfer section in the revision to include explicit comparative statistics (e.g., pairwise distances or ANOVA results) on surface features across the datasets, confirming substantial differences in length, lexical properties, and syntax while they share only the truth label. We will also report transfer performance after partialling out these features, showing that generalization is driven by the shared truth direction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evidence chain is self-contained

full rationale

The paper presents three lines of evidence (visualizations of activations, probe transfer across datasets, and causal interventions on identified directions) rather than a mathematical derivation. Probes are fit to labels, but transfer performance on distinct datasets and the causal editing results (flipping model outputs by adding/subtracting the direction) constitute independent tests that do not reduce to the fitting step by construction. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing steps. The difference-in-mean probe is compared to other methods and shown to be more causally effective, but this is an empirical comparison, not a tautological renaming. The analysis remains within standard supervised probing plus intervention methodology without the forbidden patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that linear directions in activation space can capture semantic properties such as truth, plus the practical assumption that the curated true/false datasets isolate truth from other correlated features.

axioms (1)
  • domain assumption Linear probes on LLM activations can recover semantic distinctions such as truth value
    Invoked by the decision to train and evaluate linear classifiers on internal representations

pith-pipeline@v0.9.0 · 5493 in / 1097 out tokens · 43912 ms · 2026-05-12T19:29:26.086618+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  2. Geometric Factual Recall in Transformers

    cs.CL 2026-05 conditional novelty 8.0

    A single-layer transformer memorizes random subject-attribute bijections using logarithmic embedding dimension via linear superpositions in embeddings and ReLU-gated selection in the MLP, with zero-shot transfer to ne...

  3. Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

    cs.LG 2026-04 accept novelty 8.0

    Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

  4. Deep Minds and Shallow Probes

    cs.LG 2026-05 unverdicted novelty 7.0

    Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

  5. Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs

    cs.CL 2026-05 unverdicted novelty 7.0

    LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.

  6. Steer Like the LLM: Activation Steering that Mimics Prompting

    cs.CL 2026-05 unverdicted novelty 7.0

    PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.

  7. Cell-Based Representation of Relational Binding in Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...

  8. Emotion Concepts and their Function in a Large Language Model

    cs.AI 2026-04 unverdicted novelty 7.0

    Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

  9. Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

    cs.AI 2026-04 conditional novelty 7.0

    NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.

  10. The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

    cs.LG 2026-03 unverdicted novelty 7.0

    The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a stro...

  11. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  12. Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

    cs.LG 2026-05 unverdicted novelty 6.0

    Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

  13. When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

    cs.AI 2026-05 unverdicted novelty 6.0

    Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.

  14. Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...

  15. Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

    cs.AI 2026-05 unverdicted novelty 6.0

    SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.

  16. The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations

    cs.AI 2026-05 unverdicted novelty 6.0

    Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.

  17. Architecture, Not Scale: Circuit Localization in Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.

  18. Tool Calling is Linearly Readable and Steerable in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

  19. Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer

    cs.LG 2026-05 unverdicted novelty 6.0

    Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.

  20. Hallucination Detection via Activations of Open-Weight Proxy Analyzers

    cs.CL 2026-05 unverdicted novelty 6.0

    A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.

  21. Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits

    cs.CL 2026-05 unverdicted novelty 6.0

    Probabilistic circuits detect LLM hallucinations as residual-stream anomalies with up to 99% AUROC and enable dynamic correction that raises truthfulness scores while cutting unnecessary output corruption.

  22. Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance

    cs.LG 2026-05 unverdicted novelty 6.0

    Probe-geometry alignment erases cross-sequence memorization signatures in LLMs below chance using per-depth rank-one activation interventions with negligible impact on zero-shot capabilities.

  23. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  24. Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...

  25. Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

    cs.LG 2026-04 unverdicted novelty 6.0

    Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out har...

  26. Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

    cs.LG 2026-04 unverdicted novelty 6.0

    Harmful intent is geometrically recoverable as a linear direction or angular deviation in LLM residual streams, with high AUROC across 12 models, stable under alignment variants including abliterated ones, and transfe...

  27. LLM Safety From Within: Detecting Harmful Content with Internal Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

  28. Testing the Limits of Truth Directions in LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Truth directions in LLMs are not universal but depend heavily on model layer, task type and difficulty, and prompt instructions.

  29. Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs

    cs.CL 2026-03 unverdicted novelty 6.0

    Hallucination neurons in LLMs are domain-specific, with cross-domain classifiers dropping from AUROC 0.783 within-domain to 0.563 across domains.

  30. HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory

    cs.AI 2026-05 unverdicted novelty 5.0

    HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.

  31. Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

    cs.AI 2026-05 unverdicted novelty 5.0

    Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

  32. Negative Before Positive: Asymmetric Valence Processing in Large Language Models

    cs.CL 2026-05 unverdicted novelty 5.0

    Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.

  33. Exploring Concreteness Through a Figurative Lens

    cs.CL 2026-04 unverdicted novelty 5.0

    LLMs compress concreteness into a consistent 1D direction in mid-to-late layers that separates literal from figurative noun uses and supports efficient classification plus steering.

  34. Learning Uncertainty from Sequential Internal Dispersion in Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.

  35. H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.

  36. Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...

  37. From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents

    cs.AI 2026-03 unverdicted novelty 5.0

    A conformal interpretability method labels LLM agent states step-by-step and extracts linearly separable temporal concept directions aligned with task success on ScienceWorld and AlfWorld.

  38. Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers

    cs.CV 2026-05 unverdicted novelty 4.0

    Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.

  39. Risk Reporting for Developers' Internal AI Model Use

    cs.CY 2026-04 unverdicted novelty 4.0

    A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 38 Pith papers · 1 internal anchor

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  3. [3]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  4. [4]

    2023 , eprint=

    The Internal State of an LLM Knows When its Lying , author=. 2023 , eprint=

  5. [5]

    2023 , eprint=

    Finding Neurons in a Haystack: Case Studies with Sparse Probing , author=. 2023 , eprint=

  6. [6]

    The Eleventh International Conference on Learning Representations , year=

    Discovering Latent Knowledge in Language Models Without Supervision , author=. The Eleventh International Conference on Learning Representations , year=

  7. [7]

    2023 , eprint=

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2023 , eprint=

  8. [8]

    2023 , eprint=

    Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks , author=. 2023 , eprint=

  9. [9]

    2023 , url=

    What Discovering Latent Knowledge Did and Did Not Find , author=. 2023 , url=

  10. [11]

    2023 , eprint=

    LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

  11. [12]

    The Journal of Machine Learning Research , volume=

    The implicit bias of gradient descent on separable data , author=. The Journal of Machine Learning Research , volume=. 2018 , publisher=

  12. [13]

    2023 , eprint=

    Linearity of Relation Decoding in Transformer Language Models , author=. 2023 , eprint=

  13. [15]

    2022 , eprint=

    Discovering Language Model Behaviors with Model-Written Evaluations , author=. 2022 , eprint=

  14. [16]

    2021 , url=

    Eliciting latent knowledge: How to tell if your eyes deceive you , author=. 2021 , url=

  15. [17]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  16. [18]

    2021 , eprint=

    Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color , author=. 2021 , eprint=

  17. [19]

    International Conference on Learning Representations , year=

    Mapping Language Models to Grounded Conceptual Spaces , author=. International Conference on Learning Representations , year=

  18. [20]

    The Eleventh International Conference on Learning Representations , year=

    Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author=. The Eleventh International Conference on Learning Representations , year=

  19. [21]

    2020 , doi =

    Bau, David and Zhu, Jun-Yan and Strobelt, Hendrik and Lapedriza, Agata and Zhou, Bolei and Torralba, Antonio , title =. 2020 , doi =

  20. [24]

    Analyzing Individual Neurons in Pre-trained Language Models

    Durrani, Nadir and Sajjad, Hassan and Dalvi, Fahim and Belinkov, Yonatan. Analyzing Individual Neurons in Pre-trained Language Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.395

  21. [25]

    AAAI Conference on Artificial Intelligence , year=

    What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models , author=. AAAI Conference on Artificial Intelligence , year=

  22. [26]

    Distill , year =

    Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

  23. [27]

    2022 , journal=

    Toy Models of Superposition , author=. 2022 , journal=

  24. [28]

    2023 , eprint=

    Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

  25. [29]

    Distill , year =

    Goh, Gabriel and †, Nick Cammarata and †, Chelsea Voss and Carter, Shan and Petrov, Michael and Schubert, Ludwig and Radford, Alec and Olah, Chris , title =. Distill , year =

  26. [30]

    Designing and Interpreting Probes with Control Tasks

    Hewitt, John and Liang, Percy. Designing and Interpreting Probes with Control Tasks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1275

  27. [31]

    Probing Classifiers: Promises, Shortcomings, and Advances

    Belinkov, Yonatan. Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics. 2022. doi:10.1162/coli_a_00422

  28. [32]

    Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence , pages =

    Pearl, Judea , title =. Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence , pages =. 2001 , isbn =

  29. [33]

    Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

    Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Singer, Yaron and Shieber, Stuart , booktitle =. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

  30. [34]

    Locating and Editing Factual Associations in

    Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov , journal=. Locating and Editing Factual Associations in

  31. [35]

    2023 , eprint=

    Explore, Establish, Exploit: Red Teaming Language Models from Scratch , author=. 2023 , eprint=

  32. [36]

    All Cities with a population

    Geonames , year =. All Cities with a population

  33. [37]

    2023 , url=

    Emergent Deception and Emergent Optimization , author=. 2023 , url=

  34. [38]

    2023 , eprint=

    AI Deception: A Survey of Examples, Risks, and Potential Solutions , author=. 2023 , eprint=

  35. [39]

    2018 , eprint=

    Understanding intermediate layers using linear classifier probes , author=. 2018 , eprint=

  36. [40]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  37. [41]

    GPT - N eo X -20 B : An open-source autoregressive language model

    Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel. GPT - N eo X -20 B : An Open-Source ...

  38. [42]

    2022 , eprint=

    OPT: Open Pre-trained Transformer Language Models , author=. 2022 , eprint=

  39. [43]

    2023 , eprint=

    Localizing Model Behavior with Path Patching , author=. 2023 , eprint=

  40. [44]

    2023 , eprint=

    Linear Representations of Sentiment in Large Language Models , author=. 2023 , eprint=

  41. [45]

    2016 , eprint=

    Concrete Problems in AI Safety , author=. 2016 , eprint=

  42. [46]

    2022 , eprint=

    Measuring Progress on Scalable Oversight for Large Language Models , author=. 2022 , eprint=

  43. [47]

    2023 , eprint=

    Deep reinforcement learning from human preferences , author=. 2023 , eprint=

  44. [48]

    2023 , eprint=

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , author=. 2023 , eprint=

  45. [49]

    2015 , url =

    Collaborative data science , publisher =. 2015 , url =

  46. [52]

    2023 , eprint=

    Dissecting Recall of Factual Associations in Auto-Regressive Language Models , author=. 2023 , eprint=

  47. [53]

    Causal Abstractions of Neural Networks , booktitle =

    Atticus Geiger and Hanson Lu and Thomas Icard and Christopher Potts , editor =. Causal Abstractions of Neural Networks , booktitle =. 2021 , url =

  48. [54]

    2023 , eprint=

    Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

  49. [56]

    2024 , eprint=

    Steering Llama 2 via Contrastive Activation Addition , author=. 2024 , eprint=

  50. [57]

    2021 , url=

    Yasumasa Onoe and Michael JQ Zhang and Eunsol Choi and Greg Durrett , booktitle=. 2021 , url=

  51. [58]

    2023 , url=

    Nora Belrose and David Schneider-Joseph and Shauli Ravfogel and Ryan Cotterell and Edward Raff and Stella Biderman , booktitle=. 2023 , url=

  52. [59]

    Can language models encode perceptual structure without grounding? a case study in color, 2021

    Mostafa Abdou, Artur Kulmizev, Daniel Hershcovich, Stella Frank, Ellie Pavlick, and Anders Søgaard. Can language models encode perceptual structure without grounding? a case study in color, 2021

  53. [60]

    Understanding intermediate layers using linear classifier probes, 2018

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, 2018

  54. [61]

    The internal state of an llm knows when its lying, 2023

    Amos Azaria and Tom Mitchell. The internal state of an llm knows when its lying, 2023

  55. [62]

    a is b” fail to learn “b is a

    David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 2020. ISSN 0027-8424. doi:10.1073/pnas.1907375117. URL https://www.pnas.org/content/early/2020/08/31/1907375117

  56. [63]

    LEACE : Perfect linear concept erasure in closed form

    Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE : Perfect linear concept erasure in closed form. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=awIpKpwTwF

  57. [64]

    Discovering latent knowledge in language models without supervision

    Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ETKGuby0hcs

  58. [65]

    Explore, establish, exploit: Red teaming language models from scratch, 2023

    Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, and Dylan Hadfield-Menell. Explore, establish, exploit: Red teaming language models from scratch, 2023

  59. [66]

    Eliciting latent knowledge: How to tell if your eyes deceive you, 2021

    Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you, 2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.jrzi4atzacns

  60. [67]

    Sparse autoencoders find highly interpretable features in language models, 2023

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023

  61. [68]

    Fahim Dalvi, Nadir Durrani, Hassan Sajjad, Yonatan Belinkov, Anthony Bau, and James R. Glass. What is one grain of sand in the desert? analyzing individual neurons in deep nlp models. In AAAI Conference on Artificial Intelligence, 2018. URL https://api.semanticscholar.org/CorpusID:56895415

  62. [69]

    Toy models of superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022

  63. [70]

    Causal analysis of syntactic agreement mechanisms in neural language models

    Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internat...

  64. [71]

    R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7 0 (2): 0 179--188, 1936. doi:https://doi.org/10.1111/j.1469-1809.1936.tb02137.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1936.tb02137.x

  65. [72]

    Neural natural language inference models partially embed theories of lexical entailment and negation

    Atticus Geiger, Kyle Richardson, and Christopher Potts. Neural natural language inference models partially embed theories of lexical entailment and negation. In Afra Alishahi, Yonatan Belinkov, Grzegorz Chrupala, Dieuwke Hupkes, Yuval Pinter, and Hassan Sajjad (eds.), Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Netwo...

  66. [73]

    Causal abstractions of neural networks

    Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December ...

  67. [74]

    All cities with a population > 1000, 2023

    Geonames. All cities with a population > 1000, 2023. URL https://download.geonames.org/export/dump/

  68. [75]

    Dissecting recall of factual associations in auto-regressive language models, 2023

    Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models, 2023

  69. [76]

    Multimodal neurons in artificial neural networks

    Gabriel Goh, Nick Cammarata †, Chelsea Voss †, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal neurons in artificial neural networks. Distill, 2021. doi:10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons

  70. [77]

    Finding neurons in a haystack: Case studies with sparse probing, 2023

    Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing, 2023

  71. [78]

    B. A. Levinstein and Daniel A. Herrmann. Still no lie detector for language models: Probing empirical and conceptual roadblocks, 2023

  72. [79]

    Li, Maxwell Nye, and Jacob Andreas

    Belinda Z. Li, Maxwell Nye, and Jacob Andreas. Implicit representations of meaning in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 1813--1827, Online, August 2021. Association for Co...

  73. [80]

    Emergent world representations: Exploring a sequence model trained on a synthetic task

    Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Vi \'e gas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations, 2023 a . URL https://openreview.net/forum?id=DeG07_TcZvT

  74. [81]

    Inference-time intervention: Eliciting truthful answers from a language model, 2023 b

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model, 2023 b

  75. [82]

    Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap, volume

    Stephanie Lin, Jacob Hilton, and Owain Evans. T ruthful QA : Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3214--3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.229. URL https://a...

  76. [83]

    Locating and editing factual associations in GPT

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . Advances in Neural Information Processing Systems, 36, 2022

  77. [84]

    CREAK : A dataset for commonsense reasoning over entity knowledge

    Yasumasa Onoe, Michael JQ Zhang, Eunsol Choi, and Greg Durrett. CREAK : A dataset for commonsense reasoning over entity knowledge. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=mbW_GT3ZN-

  78. [85]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  79. [86]

    Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks

    Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks. Ai deception: A survey of examples, risks, and potential solutions, 2023

  80. [87]

    Mapping language models to grounded conceptual spaces

    Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gJcEM8sxHK

Showing first 80 references.