WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
hub Mixed citations
org/abs/2305.01610
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.
Neurons exhibit concept-conditioned activation ranges forming Gaussian-like distributions with minimal overlap, and range-based interventions via NeuronLens outperform neuron-level masking in targeted manipulation with reduced collateral effects.
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.
A vision transformer for runway keypoint regression is decomposed via K-SVD into content and style atoms; the model relies primarily on content atoms, enabling out-of-model-scope detection for runtime assurance in aviation.
Chessformer is a unified encoder-only transformer for chess that uses square tokens, geometric attention bias, and an attention-based policy head to set new records in human move prediction accuracy, playing strength, and interpretability.
Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.
LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
Pretrained vision transformers use specific attention heads sensitive to Gestalt continuity for object binding, shown via probes on synthetic datasets and ablation experiments.
GPT-2 Small's terminal MLP implements a legible three-tier exception handler with 27 named neurons that routes predictions, while previously identified knowledge neurons function as amplifiers of residual-stream signals rather than fact storage.
Language models contain localized entity-selective neurons in early layers that causally mediate factual recall for specific entities across surface variations.
MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.
Introduces Modality Dominance Score (MDS) to measure modality-specific features in VLMs and applies training-free editing to improve bias mitigation, adversarial generation, and modality control.
Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.
Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
citing papers explorer
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
-
Do Audio-Visual Large Language Models Really See and Hear?
AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.
-
Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Neurons exhibit concept-conditioned activation ranges forming Gaussian-like distributions with minimal overlap, and range-based interventions via NeuronLens outperform neuron-level masking in targeted manipulation with reduced collateral effects.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection
Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.
-
Mechanistic Interpretability for Learning Assurance of a Vision-Based Landing System
A vision transformer for runway keypoint regression is decomposed via K-SVD into content and style atoms; the model relies primarily on content atoms, enabling out-of-model-scope detection for runtime assurance in aviation.
-
Chessformer: A Unified Architecture for Chess Modeling
Chessformer is a unified encoder-only transformer for chess that uses square tokens, geometric attention bias, and an attention-based policy head to set new records in human move prediction accuracy, playing strength, and interpretability.
-
A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models
Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.
-
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.
-
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.
-
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
-
I Walk the Line: Examining the Role of Gestalt Continuity in Object Binding for Vision Transformers
Pretrained vision transformers use specific attention heads sensitive to Gestalt continuity for object binding, shown via probes on synthetic datasets and ablation experiments.
-
Darkness Visible: Reading the Exception Handler of a Language Model
GPT-2 Small's terminal MLP implements a legible three-tier exception handler with 27 named neurons that routes predictions, while previously identified knowledge neurons function as amplifiers of residual-stream signals rather than fact storage.
-
Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models
Language models contain localized entity-selective neurons in early layers that causally mediate factual recall for specific entities across surface variations.
-
Foundation Models for Discovery and Exploration in Chemical Space
MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.
-
Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models
Introduces Modality Dominance Score (MDS) to measure modality-specific features in VLMs and applies training-free editing to improve bias mitigation, adversarial generation, and modality control.
-
Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework
Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.
-
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.
-
There Will Be a Scientific Theory of Deep Learning
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.