Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
hub Mixed citations
Eliciting Latent Predictions from Transformers with the Tuned Lens
Mixed citation behavior. Most common role is background (47%).
abstract
We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the tuned lens, is a refinement of the earlier "logit lens" technique, which yielded useful insights but is often brittle. We test our method on various autoregressive language models with up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens. With causal experiments, we show the tuned lens uses similar features to the model itself. We also find the trajectory of latent predictions can be used to detect malicious inputs with high accuracy. All code needed to reproduce our results can be found at https://github.com/AlignmentResearch/tuned-lens.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the tuned lens, is a refinement of the earlier "logit lens" technique, which yielded useful insights but is often brittle. We test our method on various autoregressive language models with up to 20B parameters, showing it to be more predictive, reliable and unbiased than the
co-cited works
representative citing papers
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.
AI weather models may simulate the atmosphere via particle positions in latent space whose updates follow gradient flow on a learned free energy functional rather than conventional physical equations.
LLM societies in Nomic show non-monotonic collective adaptation peaking at mid-scales, with smaller models rule-inert and larger ones restrictive.
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
Random Soft Prompts (RSPs) sampled from the embedding distribution improve Pass@N on reasoning benchmarks by increasing early-stage token diversity without any training.
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
In-context learning binds model outputs to the demonstrated label tokens as an exhaustive vocabulary, overriding semantic plausibility and causing fixation even with homogeneous or nonsense labels.
Instruction-tuned language models stabilize their next-token predictions later in the forward pass than pretrained models, with late MLP layers providing the strongest tested control point under matched histories.
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
Tabular foundation models show substantial depthwise redundancy, so a looped single-layer version achieves comparable results with 20% of the original parameters.
Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.
Transformers store count information internally but cannot read it out as digits due to near-orthogonal alignment with output-head rows; updating digit rows or applying LoRA to attention layers improves constrained and unconstrained counting respectively.
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
Token Activation Maps applied to MLLM art descriptions reveal that visual grounding strength varies by token category, with better artist identification than title prediction.
Introduces BrainSurgery, a declarative YAML-based tool for reproducible tensor surgery on deep learning checkpoints with built-in validation assertions, shown via examples and case studies in model upcycling and LoRA extraction.
CDS-trained BabyLMs show earlier and more appropriate production in a new frame-completion task while FineWeb-edu models lead on comprehension benchmarks, indicating current tests underestimate CDS benefits.
Deception probes in LLMs collapse under stylistic shifts but recover with style-augmented training, rejecting single-direction and entropy hypotheses in favor of distributed multi-dimensional signals.
LLMs represent semantic operations in NLI via decodable, partially distinct activation subspaces whose steering causally affects predictions, though with model-dependent steerability and cross-operation interference.
Sparse autoencoders applied to GPT-2 and Llama models recover semantic features accounting for 94% of peak brain encoding performance and map onto distinct cortical semantic regions across three languages.
citing papers explorer
-
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
-
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
-
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
-
PRISM: Recovering Instruction Sets from Language Model Activations
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
-
Toward Calibrated, Fair, and accurate Deepfake Detection
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
-
Training-Free Looped Transformers
Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.
-
The physics of AI weather models
AI weather models may simulate the atmosphere via particle positions in latent space whose updates follow gradient flow on a learned free energy functional rather than conventional physical equations.
-
Scale-Dependent Collective Adaptation in Self-Amending LLM Societies: A Cross-Family Study of Emergent Governance
LLM societies in Nomic show non-monotonic collective adaptation peaking at mid-scales, with smaller models rule-inert and larger ones restrictive.
-
Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
-
From Noise to Diversity: Random Embedding Injection in LLM Reasoning
Random Soft Prompts (RSPs) sampled from the embedding distribution improve Pass@N on reasoning benchmarks by increasing early-stage token diversity without any training.
-
Deep Minds and Shallow Probes
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
-
In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification
In-context learning binds model outputs to the demonstrated label tokens as an exhaustive vocabulary, overriding semantic plausibility and causing fixation even with homogeneous or nonsense labels.
-
The Convergence Gap: Instruction-Tuned Language Models Stabilize Later in the Forward Pass
Instruction-tuned language models stabilize their next-token predictions later in the forward pass than pretrained models, with late MLP layers providing the strongest tested control point under matched histories.
-
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
-
Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models
Tabular foundation models show substantial depthwise redundancy, so a looped single-layer version achieves comparable results with 20% of the original parameters.
-
Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training
Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.
-
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It
Transformers store count information internally but cannot read it out as digits due to near-orthogonal alignment with output-head rows; updating digit rows or applying LoRA to attention layers improves constrained and unconstrained counting respectively.
-
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
-
Understanding How MLLMs Describe Artworks Using Token Activation Maps
Token Activation Maps applied to MLLM art descriptions reveal that visual grounding strength varies by token category, with better artist identification than title prediction.
-
BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling
Introduces BrainSurgery, a declarative YAML-based tool for reproducible tensor surgery on deep learning checkpoints with built-in validation assertions, shown via examples and case studies in model upcycling and LoRA extraction.
-
Child-directed speech facilitates production, not comprehension, in BabyLMs
CDS-trained BabyLMs show earlier and more appropriate production in a new frame-completion task while FineWeb-edu models lead on comprehension benchmarks, indicating current tests underestimate CDS benefits.
-
Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations
Deception probes in LLMs collapse under stylistic shifts but recover with style-augmented training, rejecting single-direction and entropy hypotheses in favor of distributed multi-dimensional signals.
-
Is Inference Mediated by Distinct Semantic Structures in LLMs? A Mechanistic Interpretation
LLMs represent semantic operations in NLI via decodable, partially distinct activation subspaces whose steering causally affects predictions, though with model-dependent steerability and cross-operation interference.
-
Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography
Sparse autoencoders applied to GPT-2 and Llama models recover semantic features accounting for 94% of peak brain encoding performance and map onto distinct cortical semantic regions across three languages.
-
How Many Different Outputs Can a Transformer Generate?
Transformers are limited to a linearly growing number of accessible output sequences with prompt length, with exponential decay in accessible proportion beyond a critical point, even under unbounded context.
-
Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)
SVD on the lm_head weight matrix of transformers reveals interpretable vocabulary clusters that indicate training data composition, model differences, and ethical concerns in models like GPT-OSS, Gemma, and Qwen.
-
Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry
Geometry-Lite decomposes LLM safety detection into layer-wise margin geometries and finds that persistent boundary positions, not layer-to-layer drift, drive most detection performance across nine models and seven benchmarks.
-
Scale Determines Whether Language Models Organize Representation Geometry for Prediction
Representation geometry in language models aligns with the unembedding readout subspace in a scale-dependent manner, preserved throughout training in large models but progressively lost in late layers of small models despite continued loss improvement.
-
Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning
Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.
-
Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
Training installs a depth-dependent spectral gradient and low-rank bottleneck in LLM residual streams whose amplification or suppression of graph communities is predicted by local operator type.
-
N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.
-
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
-
Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models
Instruction token embeddings encode visual information that can be leveraged to detect object hallucinations in MLLMs via a new combined score outperforming prior detectors.
-
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.
-
Instructions Shape Production of Language, not Processing
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
-
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
-
A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases
LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.
-
Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization
Trajectory geometry in embedding space fused with coverage and verbalization yields better black-box CoT confidence estimation than self-consistency at lower sample counts across six benchmark-reasoner pairs.
-
Large Vision-Language Models Get Lost in Attention
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
-
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
Attention sharpness barely predicts VLM correctness while hidden-state probes and self-consistency strongly do, with late-fusion models showing fragile reliability bottlenecks unlike early-fusion ones.
-
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
Predicting Where Steering Vectors Succeed
The Linear Accessibility Profile predicts steering vector effectiveness and optimal layers with Spearman correlations of 0.86-0.91 using unembedding projections on intermediate states across multiple models and concepts.
-
Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task
Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.
-
Darkness Visible: Reading the Exception Handler of a Language Model
GPT-2 Small's terminal MLP implements a legible three-tier exception handler with 27 named neurons that routes predictions, while previously identified knowledge neurons function as amplifiers of residual-stream signals rather than fact storage.
-
Automated Attention Pattern Discovery at Scale in Large Language Models
AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.
-
From Words to Amino Acids: Does the Curse of Depth Persist?
Protein language models exhibit consistent depth inefficiency where most task-relevant computation occurs in a subset of layers, mirroring patterns in large language models.
-
Graph-Regularized Sparse Autoencoders for LLM Safety Steering
GSAE improves selective refusal on safety benchmarks by smoothing SAE directions over a co-activation graph and applying them via a two-gate controller, outperforming standard SAEs and baselines on Llama-3 and other models.