The paper proves negative weight drift at initialization under MSE or cross-entropy with asymmetric activations, links it to up to 90% sparsity in GPT-nano, maps the sparsity-accuracy cliff across 79 configurations, and shows clipped ReLU² and GELU² improve validation loss.
hub
Primer: Searching for efficient transformers for language modeling, 2022.URL https://arxiv
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A new sequential interaction framework lets LLMs propose questions to forums, with simulations on real Stack Exchange data showing players can reach roughly half the utility of an ideal full-information scenario despite incentive misalignment.
Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
Three-Phase Transformer partitions hidden states into N cyclic channels with phase-respecting RMSNorm and Givens rotations plus an orthogonal Gabriel's horn DC injection, delivering 7.2% lower perplexity and 1.93x faster convergence than a matched RoPE baseline at 123M parameters.
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
Factorial experiments with over 1300 runs falsify the hypothesis that INT6 QAT needs a different LR schedule from higher precision and identify a 50M-parameter boundary for INT4 schedule sensitivity.
ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
SPON adds a small set of trainable input-independent activation vectors as representational anchors, trained by distribution matching, to stabilize sparse activation in LLMs and recover performance lost to hidden-state distribution shifts.
citing papers explorer
-
Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes
The paper proves negative weight drift at initialization under MSE or cross-entropy with asymmetric activations, links it to up to 90% sparsity in GPT-nano, maps the sparsity-accuracy cliff across 79 configurations, and shows clipped ReLU² and GELU² improve validation loss.
-
From Competition to Collaboration: Designing Sustainable Mechanisms Between LLMs and Online Forums
A new sequential interaction framework lets LLMs propose questions to forums, with simulations on real Stack Exchange data showing players can reach roughly half the utility of an ideal full-information scenario despite incentive misalignment.
-
Fast Inference from Transformers via Speculative Decoding
Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.
-
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
-
On the global convergence of gradient descent for wide shallow models with bounded nonlinearities
Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
-
Three-Phase Transformer
Three-Phase Transformer partitions hidden states into N cyclic channels with phase-respecting RMSNorm and Givens rotations plus an orthogonal Gabriel's horn DC injection, delivering 7.2% lower perplexity and 1.93x faster convergence than a matched RoPE baseline at 123M parameters.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
-
Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training
Factorial experiments with over 1300 runs falsify the hypothesis that INT6 QAT needs a different LR schedule from higher precision and identify a 50M-parameter boundary for INT4 schedule sensitivity.
-
ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity
ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.
-
NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
-
Resting Neurons, Active Insights: Robustifying Activation Sparsity in LLMs via Spontaneity
SPON adds a small set of trainable input-independent activation vectors as representational anchors, trained by distribution matching, to stabilize sparse activation in LLMs and recover performance lost to hidden-state distribution shifts.