arxiv: 2303.08112 · v6 · submitted 2023-03-14 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

Eliciting Latent Predictions from Transformers with the Tuned Lens

Danny Halawi, Igor Ostrovsky, Jacob Steinhardt, Lev McKinney, Logan Smith, Nora Belrose, Stella Biderman, Zach Furman

Pith reviewed 2026-05-12 16:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords tuned lenslogit lenstransformer interpretabilitylatent predictionsaffine probeslayer-wise decodingmalicious input detection

0 comments

The pith

Affine probes trained per layer decode transformer hidden states into reliable vocabulary predictions, outperforming the logit lens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains a small affine probe on the hidden state of each block in a frozen pretrained transformer, allowing every layer to be decoded into a distribution over tokens. This tuned lens refines the logit lens by learning a per-layer adjustment rather than using a fixed projection. Experiments across models up to 20B parameters show the resulting latent predictions are more accurate, stable, and less biased. Causal interventions confirm the probes rely on features the model itself uses. The sequence of these predictions across layers also serves as an effective signal for identifying malicious inputs.

Core claim

We train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the tuned lens, is a refinement of the earlier logit lens technique. We show it to be more predictive, reliable and unbiased than the logit lens. With causal experiments, we show the tuned lens uses similar features to the model itself. The trajectory of latent predictions can be used to detect malicious inputs with high accuracy.

What carries the argument

The tuned lens: a learned affine probe fitted independently to each layer's hidden state that maps it to logits over the vocabulary.

If this is right

The method provides a stable view of how token predictions are refined layer by layer during inference.
Prediction trajectories across layers can flag anomalous or adversarial inputs without additional model training.
The probes generalize across autoregressive language models of varying sizes up to 20 billion parameters.
Causal evidence indicates the decoded distributions reflect features the model actually computes rather than spurious correlations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could support layer-specific model editing by identifying which layers hold the decisive prediction shifts.
Similar per-layer probes might apply to non-language transformers for inspecting feature refinement in other domains.
Tracking divergence between tuned-lens predictions and final output could serve as an online monitor for unexpected model behavior.

Load-bearing premise

The learned affine probes recover the model's actual internal computation rather than merely fitting a convenient readout that happens to correlate with the final output.

What would settle it

A causal intervention that changes the model's final output but leaves the tuned lens predictions at intermediate layers unchanged, or a direct comparison showing the tuned lens no more predictive than the logit lens on held-out data.

read the original abstract

We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the tuned lens, is a refinement of the earlier "logit lens" technique, which yielded useful insights but is often brittle. We test our method on various autoregressive language models with up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens. With causal experiments, we show the tuned lens uses similar features to the model itself. We also find the trajectory of latent predictions can be used to detect malicious inputs with high accuracy. All code needed to reproduce our results can be found at https://github.com/AlignmentResearch/tuned-lens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The tuned lens gives a practical, trainable upgrade to the logit lens for reading layer-wise predictions in transformers, plus a side application for spotting malicious inputs, but the causal evidence does not fully close the gap on whether it recovers the model's actual internal process.

read the letter

The tuned lens trains a small affine probe on each layer's hidden states to map them to a vocabulary distribution. This is a direct refinement of the logit lens, and the experiments show it produces more accurate and less biased per-layer predictions across models up to 20B parameters. They also demonstrate that the sequence of these readouts can flag malicious inputs on separate benchmarks, which is a new use case not in the earlier logit-lens papers. The code release makes the results easy to check and extend.

Referee Report

2 major / 3 minor

Summary. The paper introduces the tuned lens: for each layer of a frozen pretrained autoregressive transformer, an affine probe is trained to map the hidden state h_l to a distribution over the vocabulary. This refines the logit lens by learning per-layer readouts. Experiments on models up to 20B parameters show the tuned lens yields more predictive, reliable, and unbiased per-layer predictions than the logit lens. Causal interventions (feature ablations and edits) indicate that the probes rely on features similar to those used by the model itself. The sequence of latent predictions across layers is also shown to detect malicious inputs with high accuracy. Full reproduction code is released.

Significance. If the central claims hold, the tuned lens supplies a practical, scalable tool for inspecting how transformers iteratively refine next-token predictions layer by layer, strengthening mechanistic interpretability research. The multi-scale experiments, causal interventions, and released code are concrete strengths that increase the work's utility. The malicious-input detection result points to a potential safety application, though it is secondary to the interpretability contribution.

major comments (2)

[§4.3] §4.3 (Causal Experiments): The feature interventions and ablations are performed on activations after the affine probes have already been fit to those same activations (or to final logits). This design tests consistency between the learned readout and the intervention effect but does not directly demonstrate that the probes recover the model's native layer-wise computation rather than a convenient linear approximation of downstream layers. A control that compares probe predictions to the model's own un-probed intermediate computations (e.g., via direct logit extraction without regression) would be needed to support the claim that the tuned lens 'uses similar features to the model.'
[§3.2] §3.2 (Definition of 'unbiased'): The claim that the tuned lens is 'unbiased' relative to the logit lens is central to the comparison, yet the precise metric (e.g., whether it refers to calibration error, KL divergence to final logits, or something else) is not formalized before the experiments. Without an explicit definition or derivation showing that the affine map removes a specific bias term, the superiority claim rests on empirical tables whose interpretation depends on this choice.

minor comments (3)

[Table 1, Figure 2] Table 1 and Figure 2: axis labels and legends use inconsistent notation for 'logit lens' vs. 'tuned lens' across panels; standardize to improve readability.
[§4.4] §4.4 (Malicious input detection): the reported accuracy is given without a baseline that uses only final-layer logits or a simple perplexity threshold; adding this control would clarify the incremental value of the trajectory.
[Abstract / §4.1] The abstract states results 'across multiple model sizes up to 20B' but does not list the exact models or token counts used for probe training; this detail belongs in §4.1.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where we agree and will revise the manuscript for clarity and rigor.

read point-by-point responses

Referee: [§4.3] §4.3 (Causal Experiments): The feature interventions and ablations are performed on activations after the affine probes have already been fit to those same activations (or to final logits). This design tests consistency between the learned readout and the intervention effect but does not directly demonstrate that the probes recover the model's native layer-wise computation rather than a convenient linear approximation of downstream layers. A control that compares probe predictions to the model's own un-probed intermediate computations (e.g., via direct logit extraction without regression) would be needed to support the claim that the tuned lens 'uses similar features to the model.'

Authors: We agree that the interventions are applied after probe fitting, so they primarily confirm consistency between the readout and the causal effects on activations. Because the tuned lens is trained to recover the model's final output distribution from each hidden state, the fact that interventions on the same features affect both the probe and the model's downstream computation in aligned ways provides evidence that the probes rely on features the model itself uses to refine predictions. The logit lens (direct logit extraction without learned regression) is already used as the suggested control throughout our comparisons. In the revision we will explicitly reframe §4.3 to present the logit-lens results as this control and discuss how the tuned lens improves upon it under interventions. revision: partial
Referee: [§3.2] §3.2 (Definition of 'unbiased'): The claim that the tuned lens is 'unbiased' relative to the logit lens is central to the comparison, yet the precise metric (e.g., whether it refers to calibration error, KL divergence to final logits, or something else) is not formalized before the experiments. Without an explicit definition or derivation showing that the affine map removes a specific bias term, the superiority claim rests on empirical tables whose interpretation depends on this choice.

Authors: We acknowledge that 'unbiased' was introduced without a formal definition in §3.2. In the revised manuscript we will add an explicit definition: the tuned lens is unbiased relative to the logit lens when its per-layer predictions exhibit lower average KL divergence to the final-layer output distribution (measured across models and datasets). We will also briefly explain that the learned affine map corrects for the distributional shift between intermediate hidden states and the final unembedding, while noting that this is an empirical correction rather than a full theoretical derivation of bias removal. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed derivation chain

full rationale

The paper trains affine probes on held-out data to map intermediate hidden states to vocabulary distributions and evaluates the resulting tuned lens on separate malicious-input detection benchmarks plus causal interventions. These steps do not reduce the core claims (greater predictiveness than logit lens, similar features to the model, detection utility) to fitted quantities by construction; the validation data and experiments are independent of the probe-fitting objective.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that an affine map per layer is sufficient to decode hidden states and that the model's own next-token predictions are the correct supervision signal. No new physical entities or unstated mathematical axioms are introduced.

free parameters (1)

per-layer affine probe weights and biases
Trained on the model's own outputs for each block; these are the central fitted objects.

axioms (1)

domain assumption Hidden states at each layer contain information that can be linearly mapped to the final vocabulary distribution
Invoked when the authors train the probes and interpret their outputs as latent predictions.

pith-pipeline@v0.9.0 · 5470 in / 1180 out tokens · 36860 ms · 2026-05-12T16:49:22.662071+00:00 · methodology

discussion (0)

Forward citations

Cited by 38 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
cs.LG 2026-05 unverdicted novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
cs.LG 2026-04 accept novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
From Noise to Diversity: Random Embedding Injection in LLM Reasoning
cs.AI 2026-05 conditional novelty 7.0

Random Soft Prompts (RSPs) sampled from the embedding distribution improve Pass@N on reasoning benchmarks by increasing early-stage token diversity without any training.
Deep Minds and Shallow Probes
cs.LG 2026-05 unverdicted novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification
cs.LG 2026-05 conditional novelty 7.0

In-context learning binds model outputs to the demonstrated label tokens as an exhaustive vocabulary, overriding semantic plausibility and causing fixation even with homogeneous or nonsense labels.
The Convergence Gap: Instruction-Tuned Language Models Stabilize Later in the Forward Pass
cs.LG 2026-05 unverdicted novelty 7.0

Instruction-tuned language models stabilize their next-token predictions later in the forward pass than pretrained models, with late MLP layers providing the strongest tested control point under matched histories.
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
cs.CL 2026-05 unverdicted novelty 7.0

Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models
cs.LG 2026-05 unverdicted novelty 7.0

Tabular foundation models show substantial depthwise redundancy, so a looped single-layer version achieves comparable results with 20% of the original parameters.
Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training
cs.CL 2026-05 unverdicted novelty 7.0

Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It
cs.LG 2026-05 unverdicted novelty 7.0

Transformers encode counts correctly internally but fail to read them out due to misalignment with digit output directions, fixable by updating 37k output parameters or small LoRA on attention.
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
cs.LG 2026-04 unverdicted novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
cs.LG 2026-05 unverdicted novelty 6.0

N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Instruction token embeddings encode visual information that can be leveraged to detect object hallucinations in MLLMs via a new combined score outperforming prior detectors.
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.
Instructions Shape Production of Language, not Processing
cs.CL 2026-05 unverdicted novelty 6.0

Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
cs.AI 2026-05 unverdicted novelty 6.0

Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.
Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization
cs.AI 2026-05 unverdicted novelty 6.0

Trajectory geometry in embedding space fused with coverage and verbalization yields better black-box CoT confidence estimation than self-consistency at lower sample counts across six benchmark-reasoner pairs.
Large Vision-Language Models Get Lost in Attention
cs.AI 2026-05 unverdicted novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
cs.AI 2026-05 unverdicted novelty 6.0

Attention sharpness barely predicts VLM correctness while hidden-state probes and self-consistency strongly do, with late-fusion models showing fragile reliability bottlenecks unlike early-fusion ones.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 6.0

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
Escaping Mode Collapse in LLM Generation via Geometric Regulation
cs.CL 2026-05 unverdicted novelty 6.0

Reinforced Mode Regulation (RMR) uses low-rank damping on the value cache to prevent geometric collapse and mode collapse in autoregressive LLM generation, supporting stable output down to 0.8 nats/step entropy.
Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

LLMs favor task-appropriate reasoning over conflicting instructions, yet reasoning types are linearly encoded in middle-to-late layers and can be steered to boost instruction compliance by up to 29%.
LLM Safety From Within: Detecting Harmful Content with Internal Representations
cs.AI 2026-04 unverdicted novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
Predicting Where Steering Vectors Succeed
cs.LG 2026-04 unverdicted novelty 6.0

The Linear Accessibility Profile predicts steering vector effectiveness and optimal layers with Spearman correlations of 0.86-0.91 using unembedding projections on intermediate states across multiple models and concepts.
Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task
cs.LG 2026-04 unverdicted novelty 6.0

Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.
From Attribution to Action: A Human-Centered Application of Activation Steering
cs.AI 2026-04 unverdicted novelty 6.0

Activation steering paired with attribution enables intervention-based debugging in vision models, as all 8 interviewed experts shifted to hypothesis testing, most trusted observed responses, and highlighted risks lik...
Darkness Visible: Reading the Exception Handler of a Language Model
cs.LG 2026-04 conditional novelty 6.0

GPT-2 Small's terminal MLP implements a legible three-tier exception handler with 27 named neurons that routes predictions, while previously identified knowledge neurons function as amplifiers of residual-stream signa...
Automated Attention Pattern Discovery at Scale in Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.
Instructions Shape Production of Language, not Processing
cs.CL 2026-05 unverdicted novelty 5.0

Instructions primarily shape the production stage of language models rather than the processing stage, with task-specific information and causal effects stronger in output tokens than input tokens.
Towards Effective Theory of LLMs: A Representation Learning Approach
cs.LG 2026-05 unverdicted novelty 5.0

RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
cs.AI 2026-05 unverdicted novelty 5.0

HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 5.0

PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.
Probing for Reading Times
cs.CL 2026-04 unverdicted novelty 5.0

Early layers of language models predict early-pass human reading times better than surprisal, with surprisal superior for late-pass measures and strong variation by language.
Distributed Interpretability and Control for Large Language Models
cs.LG 2026-04 conditional novelty 4.0

A distributed system for logit lens and steering vectors on multi-GPU LLMs achieves up to 7x lower activation memory and 41x higher throughput while producing monotonic output shifts with mean slope 0.702.

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · cited by 36 Pith papers · 14 internal anchors

[1]

Journal of the Royal Statistical Society: Series B (Methodological) , volume=

The statistical analysis of compositional data , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1982 , publisher=

work page 1982
[2]

arXiv preprint arXiv:2209.06640 , year=

Revisiting neural scaling laws in language and vision , author=. arXiv preprint arXiv:2209.06640 , year=

work page arXiv
[3]

Understanding intermediate layers using linear classifier probes

Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Submitted to The Eleventh International Conference on Learning Representations , year=

Overthinking the Truth: Understanding how Language Models process False Demonstrations , author=. Submitted to The Eleventh International Conference on Learning Representations , year=

work page
[5]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Advances in Neural Information Processing Systems , volume=

Deep learning through the lens of example difficulty , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

Advances in Neural Information Processing Systems , volume=

Revisiting model stitching to compare neural representations , author=. Advances in Neural Information Processing Systems , volume=

work page
[8]

Computational Linguistics , volume=

Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=. 2022 , publisher=

work page 2022
[9]

If you use this software, please cite it using these metadata , volume=

GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , author=. If you use this software, please cite it using these metadata , volume=

work page
[10]

Breakthroughs in statistics , volume=

Foresight: Its logical laws, its subjective sources , author=. Breakthroughs in statistics , volume=

work page
[11]

arXiv preprint arXiv:1609.03543 , year=

Logical induction , author=. arXiv preprint arXiv:1609.03543 , year=

work page arXiv
[12]

Philosophy of Science , volume=

Slightly more realistic personal probability , author=. Philosophy of Science , volume=. 1967 , publisher=

work page 1967
[13]

, author=

Truth and probability. , author=. Studies in subjective probability , pages=. 1926 , publisher=

work page 1926
[14]

2007 , howpublished =

Yudkowsky, Eliezer , title =. 2007 , howpublished =

work page 2007
[15]

2021 , version =

Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel , url =. 2021 , version =. doi:10.5281/zenodo.5879544 , month =

work page doi:10.5281/zenodo.5879544 2021
[16]

S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S

Gpt-neox-20b: An open-source autoregressive language model , author=. arXiv preprint arXiv:2204.06745 , year=

work page arXiv
[18]

Datasheet for the pile

Pythia: a Scaling Suite for Language Model Interpretability Research , author=. Computing Research Repository , eprint=. doi:10.48550/arXiv.2201.07311 , url=

work page doi:10.48550/arxiv.2201.07311
[19]

2006 , publisher=

Pattern recognition and machine learning , author=. 2006 , publisher=

work page 2006
[20]

Enriching word vectors with subword information.arXiv preprint arXiv:1607.04606, 2016

Enriching Word Vectors with Subword Information , author=. arXiv preprint arXiv:1607.04606 , year=

work page arXiv
[21]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[22]

Proceedings of the 2000 ACM SIGMOD international conference on Management of data , pages=

LOF: identifying density-based local outliers , author=. Proceedings of the 2000 ACM SIGMOD international conference on Management of data , pages=

work page 2000
[23]

Alignment Forum , year=

Causal Scrubbing: a method for rigorously testing interpretability hypotheses , author=. Alignment Forum , year=

work page
[24]

Advances in Neural Information Processing Systems , volume=

Similarity and Matching of Neural Network Representations , author=. Advances in Neural Information Processing Systems , volume=

work page
[25]

URL https://doi.org/10.18653/v1/2022.acl -long.581

Dai, Damai and Dong, Li and Hao, Yaru and Sui, Zhifang and Chang, Baobao and Wei, Furu. Knowledge Neurons in Pretrained Transformers. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.581

work page doi:10.18653/v1/2022.acl-long.581 2022
[26]

2022 , month = dec, journal =

Analyzing transformers in embedding space , author=. arXiv preprint arXiv:2209.02535 , year=

work page arXiv
[27]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[28]

Austrian Journal of Statistics , volume=

Changing the reference measure in the simplex and its weighting effects , author=. Austrian Journal of Statistics , volume=

work page
[29]

Transactions of the Association for Computational Linguistics , volume=

Amnesic probing: Behavioral explanation with amnesic counterfactuals , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

work page 2021
[30]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

work page 2021
[31]

Toy Models of Superposition

Toy Models of Superposition , author=. arXiv preprint arXiv:2209.10652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

12 Published as a conference paper at ICLR 2022 Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan

Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and Phang, Jason and Reynolds, Laria and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy , title =. doi:10.5281/zenodo.5371628 , url =

work page doi:10.5281/zenodo.5371628
[33]

Reducing transformer depth on demand with structured dropout

Reducing transformer depth on demand with structured dropout , author=. arXiv preprint arXiv:1909.11556 , year=

work page arXiv 1909
[34]

arXiv preprint arXiv:2106.03004 , year=

Exploring the Limits of Out-of-Distribution Detection , author=. arXiv preprint arXiv:2106.03004 , year=

work page arXiv
[35]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , year=. Computing Research Repository , eprint=. doi:10.48550/arXiv.2101.00027 , url=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2101.00027
[36]

Transformer Feed-Forward Layers Are Key-Value Memories

Transformer feed-forward layers are key-value memories , author=. arXiv preprint arXiv:2012.14913 , year=

work page internal anchor Pith review arXiv 2012
[37]

2022 , month = oct, journal =

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space , author=. arXiv preprint arXiv:2203.14680 , year=

work page arXiv
[38]

arXiv preprint arXiv:1612.07771 (2016)

Highway and residual networks learn unrolled iterative estimation , author=. arXiv preprint arXiv:1612.07771 , year=

work page arXiv
[39]

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models , author=. arXiv preprint arXiv:2301.04213 , year=

work page arXiv
[40]

doi: 10.18653/v1/2020.findings-emnlp.301

Gehman, Samuel and Gururangan, Suchin and Sap, Maarten and Choi, Yejin and Smith, Noah A. R eal T oxicity P rompts: Evaluating Neural Toxic Degeneration in Language Models. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.301

work page doi:10.18653/v1/2020.findings-emnlp.301 2020
[41]

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , year =

Benjamin Heinzerling and Michael Strube , title = ". Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , year =

work page 2018
[42]

Proceedings of the 2019 Con , year=

Designing and Interpreting Probes with Control Tasks , author=. Proceedings of the 2019 Con , year=

work page 2019
[43]

A structural probe for finding syntax in word representations , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

work page 2019
[44]

European conference on computer vision , pages=

Deep networks with stochastic depth , author=. European conference on computer vision , pages=. 2016 , organization=

work page 2016
[45]

arXiv preprint arXiv:1710.04773 , year=

Residual connections encourage iterative inference , author=. arXiv preprint arXiv:1710.04773 , year=

work page arXiv
[46]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[47]

International conference on machine learning , pages=

Shallow-deep networks: Understanding and mitigating network overthinking , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[48]

International Conference on Machine Learning , pages=

Similarity of neural network representations revisited , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[49]

arXiv preprint arXiv:2105.06990 , year=

BERT busters: Outlier dimensions that disrupt transformers , author=. arXiv preprint arXiv:2105.06990 , year=

work page arXiv
[50]

arXiv preprint arXiv:2207.04153 , year=

Probing classifiers are unreliable for concept removal and detection , author=. arXiv preprint arXiv:2207.04153 , year=

work page arXiv
[51]

The BigScience

Hugo Lauren. The BigScience. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[52]

Advances in neural information processing systems , volume=

A simple unified framework for detecting out-of-distribution samples and adversarial attacks , author=. Advances in neural information processing systems , volume=

work page
[53]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Understanding image representations by measuring their equivariance and equivalence , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[54]

arXiv preprint arXiv:2110.04844 , year=

Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits , author=. arXiv preprint arXiv:2110.04844 , year=

work page arXiv
[55]

arXiv preprint arXiv:2210.13382 , year=

Emergent world representations: Exploring a sequence model trained on a synthetic task , author=. arXiv preprint arXiv:2210.13382 , year=

work page arXiv
[56]

2008 eighth ieee international conference on data mining , pages=

Isolation forest , author=. 2008 eighth ieee international conference on data mining , pages=. 2008 , organization=

work page 2008
[57]

Journal Soc

On the generalized distances in statistics: Mahalanobis distance , author=. Journal Soc. Bengal , volume=

work page
[58]

A SICK cure for the evaluation of compositional distributional semantic models

Marelli, Marco and Menini, Stefano and Baroni, Marco and Bentivogli, Luisa and Bernardi, Raffaella and Zamparelli, Roberto. A SICK cure for the evaluation of compositional distributional semantic models. Proceedings of the Ninth International Conference on Language Resources and Evaluation ( LREC '14). 2014

work page 2014
[59]

Nature , volume=

Highly accurate protein structure prediction with AlphaFold , author=. Nature , volume=. 2021 , publisher=

work page 2021
[60]

bioRxiv , pages=

OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization , author=. bioRxiv , pages=. 2022 , publisher=

work page 2022
[61]

arXiv preprint arXiv:2303.09435 , year=

Jump to Conclusions: Short-Cutting Transformers With Linear Transformations , author=. arXiv preprint arXiv:2303.09435 , year=

work page arXiv
[62]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907
[63]

2022 , journal =

Mass-editing memory in a transformer , author=. arXiv preprint arXiv:2210.07229 , year=

work page arXiv
[64]

LessWrong , year=

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable , author=. LessWrong , year=

work page
[65]

TransformerLens , author =

work page
[66]

LessWrong , year=

interpreting GPT: the logit lens , author=. LessWrong , year=

work page
[67]

2021 , url=

logit lens on non-gpt2 models + extensions , author=. 2021 , url=

work page 2021
[68]

Distill , volume=

Zoom in: An introduction to circuits , author=. Distill , volume=

work page
[69]

Ignore Previous Prompt: Attack Techniques For Language Models

Ignore Previous Prompt: Attack Techniques For Language Models , author=. arXiv preprint arXiv:2211.09527 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

arXiv preprint arXiv:2211.06420 , year=

The Architectural Bottleneck Principle , author=. arXiv preprint arXiv:2211.06420 , year=

work page arXiv
[71]

Machine Learning for Health , pages=

Early exit ensembles for uncertainty quantification , author=. Machine Learning for Health , pages=. 2021 , organization=

work page 2021
[72]

OpenAI Blog , year=

Language Models are Unsupervised Multitask Learners , author=. OpenAI Blog , year=

work page
[73]

International Conference on Machine Learning , pages=

Linear adversarial concept erasure , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[74]

Null it out: Guarding protected attributes by iterative nullspace projection

Null it out: Guarding protected attributes by iterative nullspace projection , author=. arXiv preprint arXiv:2004.07667 , year=

work page arXiv 2004
[75]

arXiv preprint arXiv:2005.00719 , year=

Probing the probing paradigm: Does probing accuracy entail task relevance? , author=. arXiv preprint arXiv:2005.00719 , year=

work page arXiv 2005
[76]

2007 15th European signal processing conference , pages=

The effective rank: A measure of effective dimensionality , author=. 2007 15th European signal processing conference , pages=. 2007 , organization=

work page 2007
[77]

Computer Speech & Language , volume=

On the effect of dropping layers of pre-trained transformer models , author=. Computer Speech & Language , volume=. 2023 , publisher=

work page 2023
[78]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. arXiv preprint arXiv:1910.01108 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1910
[79]

Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Ellie and Ilić, Suzana and Hesslow, Daniel and Castagné, Roman and Luccioni, Alexandra Sasha and Yvon, François and Gallé, Matthias and Tow, Jonathan and Rush, Alexander M. and Biderman, Stella and Webson, Albert and Ammanamanchi, Pawan Sasanka and Wang, Thomas and Sagot, Benoît and Muenni...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.05100
[80]

arXiv preprint arXiv:2207.07061 , year=

Confident adaptive language modeling , author=. arXiv preprint arXiv:2207.07061 , year=

work page arXiv
[81]

Neural computation , volume=

Estimating the support of a high-dimensional distribution , author=. Neural computation , volume=. 2001 , publisher=

work page 2001

Showing first 80 references.