hub

Primer: Searching for efficient transformers for language modeling, 2022.URL https://arxiv

Primer: Searching for Eﬀicient Transformers for Language Modeling · 2022 · arXiv 2109.08668

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 extension 1

citation-polarity summary

extend 1 unclear 1

representative citing papers

Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes

cs.LG · 2026-05-17 · accept · novelty 7.0 · 2 refs

The paper proves negative weight drift at initialization under MSE or cross-entropy with asymmetric activations, links it to up to 90% sparsity in GPT-nano, maps the sparsity-accuracy cliff across 79 configurations, and shows clipped ReLU² and GELU² improve validation loss.

From Competition to Collaboration: Designing Sustainable Mechanisms Between LLMs and Online Forums

cs.AI · 2026-02-04 · unverdicted · novelty 7.0

A new sequential interaction framework lets LLMs propose questions to forums, with simulations on real Stack Exchange data showing players can reach roughly half the utility of an ideal full-information scenario despite incentive misalignment.

Fast Inference from Transformers via Speculative Decoding

cs.LG · 2022-11-30 · accept · novelty 7.0

Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

math.OC · 2026-05-11 · unverdicted · novelty 6.0

Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.

Three-Phase Transformer

cs.CL · 2026-04-15 · unverdicted · novelty 6.0

Three-Phase Transformer partitions hidden states into N cyclic channels with phase-respecting RMSNorm and Givens rotations plus an orthogonal Gabriel's horn DC injection, delivering 7.2% lower perplexity and 1.93x faster convergence than a matched RoPE baseline at 123M parameters.

ST-MoE: Designing Stable and Transferable Sparse Expert Models

cs.CL · 2022-02-17 · unverdicted · novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.

Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training

cs.LG · 2026-05-25 · unverdicted · novelty 5.0

Factorial experiments with over 1300 runs falsify the hypothesis that INT6 QAT needs a different LR schedule from higher precision and identify a 50M-parameter boundary for INT4 schedule sensitivity.

ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity

cs.LG · 2026-05-05 · unverdicted · novelty 5.0

ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.

NVIDIA Nemotron 3: Efficient and Open Intelligence

cs.CL · 2025-12-24 · unverdicted · novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

Resting Neurons, Active Insights: Robustifying Activation Sparsity in LLMs via Spontaneity

cs.LG · 2025-12-14 · unverdicted · novelty 5.0 · 2 refs

SPON adds a small set of trainable input-independent activation vectors as representational anchors, trained by distribution matching, to stabilize sparse activation in LLMs and recover performance lost to hidden-state distribution shifts.

citing papers explorer

Showing 1 of 1 citing paper after filters.

On the global convergence of gradient descent for wide shallow models with bounded nonlinearities math.OC · 2026-05-11 · unverdicted · none · ref 3
Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.

Primer: Searching for efficient transformers for language modeling, 2022.URL https://arxiv

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer