Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
hub
What Does BERT Look At? An Analysis of BERT's Attention
22 Pith papers cite this work. Polarity classification is still indexing.
abstract
Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused on model outputs (e.g., language model surprisal) or internal vector representations (e.g., probing classifiers). Complementary to these works, we propose methods for analyzing the attention mechanisms of pre-trained models and apply them to BERT. BERT's attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly attending over the whole sentence, with heads in the same layer often exhibiting similar behaviors. We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT's attention.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
Language-model-guided program synthesis can approximate transformer attention heads with over 75% IoU fidelity on held-out data and allow replacing 25% of heads with only 16% average perplexity increase.
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
LLMs achieve strong results on syntax parsing tasks but show limited and variable performance on dynamic reasoning, with a clear performance hierarchy across model scales.
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.
HeadRank lifts preference optimization into attention space via entropy-regularized head selection and distribution regularizers to sharpen discriminability for efficient listwise reranking.
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.
Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.
Protein language models exhibit consistent depth inefficiency where most task-relevant computation occurs in a subset of layers, mirroring patterns in large language models.
Speculative Verification adds a companion model that estimates draft-target alignment via information gain to dynamically set verification length, delivering up to 2x speedup over standard speculative decoding across tested models and batch sizes.
CD-MoE condenses fine-grained MoE layers with shared experts into dense layers, retaining 90% accuracy with 27.5% memory cut and 1.26x speedup on DeepSeekMoE-16B, recovering 98% via brief fine-tuning.
DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.
Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
K-VEC is a coverage-aware KV-cache eviction strategy using cross-head and cross-layer modules that improves performance by up to 10.35 points over prior methods on LongBench subsets at fixed memory budget.
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
LLM surprisal and attention entropy replicate syncretism modulation of agreement attraction in English and German, align with null results in Turkish, and partially match Russian patterns.
Opcode-sequence generative models produce synthetic malware data that raises minor-class classification accuracy by up to 60% and overall detection to 96%.
citing papers explorer
-
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
-
Explaining Attention with Program Synthesis
Language-model-guided program synthesis can approximate transformer attention heads with over 75% IoU fidelity on held-out data and allow replacing 25% of heads with only 16% average perplexity increase.
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs
LLMs achieve strong results on syntax parsing tasks but show limited and variable performance on dynamic reasoning, with a clear performance hierarchy across model scales.
-
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
-
A framework for analyzing concept representations in neural models
A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.
-
HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads
HeadRank lifts preference optimization into attention space via entropy-regularized head selection and distribution regularizers to sharpen discriminability for efficient listwise reranking.
-
In-context Learning and Induction Heads
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.
-
Longformer: The Long-Document Transformer
Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
-
Why Retrieval-Augmented Generation Fails: A Graph Perspective
Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.
-
From Words to Amino Acids: Does the Curse of Depth Persist?
Protein language models exhibit consistent depth inefficiency where most task-relevant computation occurs in a subset of layers, mirroring patterns in large language models.
-
Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding
Speculative Verification adds a companion model that estimates draft-target alignment via information gain to dynamically set verification length, delivering up to 2x speedup over standard speculative decoding across tested models and batch sizes.
-
Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning
CD-MoE condenses fine-grained MoE layers with shared experts into dense layers, retaining 90% accuracy with 27.5% memory cut and 1.26x speedup on DeepSeekMoE-16B, recovering 98% via brief fine-tuning.
-
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
-
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.
-
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
-
Coverage-Driven KV Cache Eviction for Efficient and Improved Inference of LLM
K-VEC is a coverage-aware KV-cache eviction strategy using cross-head and cross-layer modules that improves performance by up to 10.35 points over prior methods on LongBench subsets at fixed memory budget.
-
TIDE: Every Layer Knows the Token Beneath the Context
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
Quantifying the cross-linguistic effects of syncretism on agreement attraction
LLM surprisal and attention entropy replicate syncretism modulation of agreement attraction in English and German, align with null results in Turkish, and partially match Russian patterns.
-
Generating Synthetic Malware Samples Using Generative AI
Opcode-sequence generative models produce synthetic malware data that raises minor-class classification accuracy by up to 60% and overall detection to 96%.
- MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation