Primer: Searching for efficient transformers for language modeling

Quoc V Le · 2021 · arXiv 2109.08668

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

math.OC · 2026-05-11 · unverdicted · novelty 6.0

Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.

Three-Phase Transformer

cs.CL · 2026-04-15 · unverdicted · novelty 6.0

Three-Phase Transformer partitions hidden states into N cyclic channels with phase-respecting RMSNorm and Givens rotations plus an orthogonal Gabriel's horn DC injection, delivering 7.2% lower perplexity and 1.93x faster convergence than a matched RoPE baseline at 123M parameters.

ST-MoE: Designing Stable and Transferable Sparse Expert Models

cs.CL · 2022-02-17 · unverdicted · novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.

ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity

cs.LG · 2026-05-05 · unverdicted · novelty 5.0

ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.

citing papers explorer

Showing 5 of 5 citing papers.

Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 105
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
On the global convergence of gradient descent for wide shallow models with bounded nonlinearities math.OC · 2026-05-11 · unverdicted · none · ref 3
Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
Three-Phase Transformer cs.CL · 2026-04-15 · unverdicted · none · ref 2
Three-Phase Transformer partitions hidden states into N cyclic channels with phase-respecting RMSNorm and Givens rotations plus an orthogonal Gabriel's horn DC injection, delivering 7.2% lower perplexity and 1.93x faster convergence than a matched RoPE baseline at 123M parameters.
ST-MoE: Designing Stable and Transferable Sparse Expert Models cs.CL · 2022-02-17 · unverdicted · none · ref 200
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity cs.LG · 2026-05-05 · unverdicted · none · ref 11
ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.

Primer: Searching for efficient transformers for language modeling

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer