pith. sign in

hub

Primer: Searching for efficient transformers for language modeling, 2022.URL https://arxiv

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

hub tools

citation-role summary

background 1 extension 1

citation-polarity summary

polarities

extend 1 unclear 1

representative citing papers

Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes

cs.LG · 2026-05-17 · accept · novelty 7.0 · 2 refs

The paper proves negative weight drift at initialization under MSE or cross-entropy with asymmetric activations, links it to up to 90% sparsity in GPT-nano, maps the sparsity-accuracy cliff across 79 configurations, and shows clipped ReLU² and GELU² improve validation loss.

Fast Inference from Transformers via Speculative Decoding

cs.LG · 2022-11-30 · accept · novelty 7.0

Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

Three-Phase Transformer

cs.CL · 2026-04-15 · unverdicted · novelty 6.0

Three-Phase Transformer partitions hidden states into N cyclic channels with phase-respecting RMSNorm and Givens rotations plus an orthogonal Gabriel's horn DC injection, delivering 7.2% lower perplexity and 1.93x faster convergence than a matched RoPE baseline at 123M parameters.

ST-MoE: Designing Stable and Transferable Sparse Expert Models

cs.CL · 2022-02-17 · unverdicted · novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.

NVIDIA Nemotron 3: Efficient and Open Intelligence

cs.CL · 2025-12-24 · unverdicted · novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

citing papers explorer

Showing 11 of 11 citing papers.

  • Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes cs.LG · 2026-05-17 · accept · none · ref 20 · 2 links

    The paper proves negative weight drift at initialization under MSE or cross-entropy with asymmetric activations, links it to up to 90% sparsity in GPT-nano, maps the sparsity-accuracy cliff across 79 configurations, and shows clipped ReLU² and GELU² improve validation loss.

  • From Competition to Collaboration: Designing Sustainable Mechanisms Between LLMs and Online Forums cs.AI · 2026-02-04 · unverdicted · none · ref 42

    A new sequential interaction framework lets LLMs propose questions to forums, with simulations on real Stack Exchange data showing players can reach roughly half the utility of an ideal full-information scenario despite incentive misalignment.

  • Fast Inference from Transformers via Speculative Decoding cs.LG · 2022-11-30 · accept · none · ref 61

    Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.

  • Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 105

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  • On the global convergence of gradient descent for wide shallow models with bounded nonlinearities math.OC · 2026-05-11 · unverdicted · none · ref 3

    Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.

  • Three-Phase Transformer cs.CL · 2026-04-15 · unverdicted · none · ref 2

    Three-Phase Transformer partitions hidden states into N cyclic channels with phase-respecting RMSNorm and Givens rotations plus an orthogonal Gabriel's horn DC injection, delivering 7.2% lower perplexity and 1.93x faster convergence than a matched RoPE baseline at 123M parameters.

  • ST-MoE: Designing Stable and Transferable Sparse Expert Models cs.CL · 2022-02-17 · unverdicted · none · ref 200

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.

  • Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training cs.LG · 2026-05-25 · unverdicted · none · ref 23

    Factorial experiments with over 1300 runs falsify the hypothesis that INT6 QAT needs a different LR schedule from higher precision and identify a 50M-parameter boundary for INT4 schedule sensitivity.

  • ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity cs.LG · 2026-05-05 · unverdicted · none · ref 11

    ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.

  • NVIDIA Nemotron 3: Efficient and Open Intelligence cs.CL · 2025-12-24 · unverdicted · none · ref 92

    NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

  • Resting Neurons, Active Insights: Robustifying Activation Sparsity in LLMs via Spontaneity cs.LG · 2025-12-14 · unverdicted · none · ref 71 · 2 links

    SPON adds a small set of trainable input-independent activation vectors as representational anchors, trained by distribution matching, to stabilize sparse activation in LLMs and recover performance lost to hidden-state distribution shifts.