GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
hub
BERT Rediscovers the Classical NLP Pipeline , publisher =
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
Sparse autoencoders applied to Whisper ASR reveal monosemantic features across linguistic boundaries and demonstrate cross-lingual feature steering.
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a strong inductive bias that can raise accuracy from failure to 99.8%.
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
A composite Collapse Index based on incremental discrete Morse homology provides low-latency early warning of representational collapse during neural network training.
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to novel compositions.
Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.
Middle layers (20-80%) remain stable during SFT while final layers are sensitive, enabling Mid-Block Efficient Tuning that outperforms LoRA by up to 10.2% on GSM8K with reduced parameter count.
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
citing papers explorer
-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
-
Mechanistic Interpretability of ASR models using Sparse Autoencoders
Sparse autoencoders applied to Whisper ASR reveal monosemantic features across linguistic boundaries and demonstrate cross-lingual feature steering.
-
Latent Space Probing for Adult Content Detection in Video Generative Models
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
-
Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment
Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
-
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a strong inductive bias that can raise accuracy from failure to 99.8%.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Monitoring Neural Training with Topology: A Footprint-Predictable Collapse Index
A composite Collapse Index based on incremental discrete Morse homology provides low-latency early warning of representational collapse during neural network training.
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to novel compositions.
-
Geometric Routing Enables Causal Expert Control in Mixture of Experts
Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.
-
A Layer-wise Analysis of Supervised Fine-Tuning
Middle layers (20-80%) remain stable during SFT while final layers are sensitive, enabling Mid-Block Efficient Tuning that outperforms LoRA by up to 10.2% on GSM8K with reduced parameter count.
-
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.