arxiv: 1909.08053 · v4 · submitted 2019-09-17 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Bryan Catanzaro, Jared Casper, Mohammad Shoeybi, Mostofa Patwary, Patrick LeGresley, Raul Puri

Pith reviewed 2026-05-10 18:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords model parallelismtransformerlanguage modelinglarge-scale trainingGPT-2BERTPyTorchscaling efficiency

0 comments

The pith

Intra-layer model parallelism in native PyTorch lets transformers scale to 8.3 billion parameters across 512 GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates a straightforward way to split the computation inside each transformer layer across multiple GPUs so that memory constraints no longer limit model size. The technique inserts only a handful of communication operations and requires no new compiler or library, while remaining compatible with other forms of parallelism. Using it, the authors trained both GPT-2-style and BERT-style models at multi-billion scale, reached new state-of-the-art perplexity and accuracy on WikiText103, LAMBADA, and RACE, and sustained 15.1 petaflops at 76 percent scaling efficiency. A reader would care because bigger models have repeatedly improved language understanding, yet hardware memory had become the practical ceiling; removing that ceiling with existing software changes the feasible range of experiments.

Core claim

The authors show that an intra-layer model-parallel scheme, implemented by partitioning the linear layers inside attention and feed-forward blocks and adding the necessary all-reduce and scatter/gather calls, allows transformer models to grow to 8.3 billion parameters while still converging to better final performance than prior smaller models. They further establish that careful placement of layer normalization is required for the larger BERT-like architecture to realize these gains. The approach delivers 15.1 petaflops sustained and 76 percent weak-scaling efficiency relative to a single-GPU baseline.

What carries the argument

Intra-layer model parallelism, achieved by partitioning weight matrices of each transformer sub-layer across GPUs and using a small number of collective communication primitives inside native PyTorch to reassemble activations and gradients.

If this is right

Transformer models up to at least 8.3 billion parameters become trainable on clusters of a few hundred GPUs.
State-of-the-art results are obtained on WikiText103 (perplexity 10.8), LAMBADA (accuracy 66.5 percent), and RACE (accuracy 90.9 percent).
Sustained application throughput reaches 15.1 petaflops with 76 percent scaling efficiency relative to a single GPU baseline.
The same intra-layer technique combines directly with pipeline parallelism for further scaling.
Layer-normalization placement must be revisited when model depth and width both increase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same splitting pattern could be applied to other attention-based architectures beyond the GPT and BERT families tested here.
If communication overhead remains tolerable at still larger scales, the method lowers the barrier to experimenting with models whose size was previously limited by single-node memory.
Combining this intra-layer scheme with data parallelism would allow training even bigger models on larger GPU counts without rewriting the core training loop.
The observed need to adjust normalization placement hints that other architectural details may also require re-tuning once models exceed a few billion parameters.

Load-bearing premise

The extra communication steps inside each layer add only acceptable overhead and leave the optimizer's convergence behavior unchanged at billion-parameter scale.

What would settle it

A side-by-side run of the identical 8.3-billion-parameter model with and without the intra-layer splits, measuring both wall-clock time per step and final validation perplexity or accuracy, would show whether the added communication measurably slows training or lowers final quality.

read the original abstract

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Megatron-LM gives a straightforward PyTorch recipe for splitting transformer layers across GPUs that scales to 8B parameters with 76% efficiency and SOTA numbers on 2019 benchmarks.

read the letter

The main takeaway is that they show how to do intra-layer tensor parallelism for transformers using only a few native PyTorch communication primitives like splits and all-reduces. This lets them train up to 8.3 billion parameters on 512 GPUs, sustain 15.1 petaFLOPs, and hit 76% weak scaling efficiency against a single-GPU baseline. They also train a GPT-2-style model that improves WikiText-103 perplexity to 10.8 and LAMBADA accuracy to 66.5%, plus a BERT-style model at 3.9B that reaches 90.9% on RACE after adjusting layer-norm placement. The approach is presented as orthogonal to pipeline parallelism and requires no new compilers or libraries. What stands out is the concrete engineering: they give the exact layer splits and report that the implementation converges without apparent numerical issues. The results are direct measurements rather than derived from fitted parameters, which keeps the claims grounded. The soft spots are limited but worth noting. The abstract and reported results lack error bars, detailed ablations on communication overhead, or full scaling curves that isolate parallelism effects from other training choices. Convergence stability at 8B is asserted by the successful runs, but without more controls it is hard to quantify any hidden cost to final accuracy. The SOTA numbers are from the 2019 context and have since been surpassed, yet the scaling recipe itself remains useful. This paper is for people who need to train large models on GPU clusters. The implementation details are the real value even if you skip the exact benchmarks. It deserves a serious referee because the core technique is reproducible from the described primitives and has clear downstream impact.

Referee Report

2 major / 2 minor

Summary. The manuscript describes an intra-layer model parallelism technique for training large transformer models in native PyTorch. It shows how to train models with up to 8.3 billion parameters using 512 GPUs, achieving 15.1 PetaFLOPs with 76% scaling efficiency relative to a single-GPU baseline. The work also reports state-of-the-art performance on the WikiText103, LAMBADA, and RACE datasets using GPT-2-like and BERT-like models, with specific improvements attributed to layer normalization placement in the latter.

Significance. If validated, these results would be significant for the field as they provide a straightforward way to scale transformer training beyond single-GPU memory limits without requiring new compilers or libraries. The emphasis on implementation simplicity and the reported benchmark gains underscore the practical value of model parallelism for advancing NLP capabilities through larger models. The provision of PyTorch-level details aids reproducibility.

major comments (2)

Experiments section: The reported SOTA results (WikiText103 perplexity of 10.8 vs. prior 15.8; LAMBADA accuracy of 66.5% vs. prior 63.2%) are presented as single point estimates without error bars, standard deviations, or details on the number of independent runs, which is load-bearing for confirming these constitute genuine advances rather than run-specific outcomes.
Scaling efficiency results: The 76% weak scaling efficiency to 512 GPUs is central to the claim of acceptable overhead, yet the manuscript provides no ablation isolating the added communication primitives' impact on convergence speed, final accuracy, or numerical equivalence to a single-GPU model at the 8.3B scale.

minor comments (2)

Abstract: The mention of 'careful attention to the placement of layer normalization in BERT-like models' is important for the 3.9B model results but lacks a concrete description or pseudocode of the modification, reducing clarity for readers attempting to replicate the performance gain.
Implementation details: Additional diagrams or pseudocode showing the precise insertion of the communication operations (e.g., all-reduce) within the attention and feed-forward layers would improve the reproducibility of the intra-layer parallelism approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address the two major comments point by point below.

read point-by-point responses

Referee: Experiments section: The reported SOTA results (WikiText103 perplexity of 10.8 vs. prior 15.8; LAMBADA accuracy of 66.5% vs. prior 63.2%) are presented as single point estimates without error bars, standard deviations, or details on the number of independent runs, which is load-bearing for confirming these constitute genuine advances rather than run-specific outcomes.

Authors: We acknowledge that multiple independent runs with error bars would provide stronger statistical support for the SOTA claims. However, the extreme computational cost of training 8.3B-parameter models (requiring hundreds of GPUs over extended periods) renders repeated full runs impractical, a constraint shared by prior large-scale works such as the original GPT-2 and BERT papers. The reported gains substantially exceed prior SOTA by wide margins, and our hyperparameter search and training procedures are detailed in the manuscript. In the revised version we will add a discussion in the Experiments section noting the single-run nature of the results and the resource limitations that preclude multiple runs. revision: partial
Referee: Scaling efficiency results: The 76% weak scaling efficiency to 512 GPUs is central to the claim of acceptable overhead, yet the manuscript provides no ablation isolating the added communication primitives' impact on convergence speed, final accuracy, or numerical equivalence to a single-GPU model at the 8.3B scale.

Authors: We agree that further isolation of communication overhead would be useful. At the 8.3B scale a single-GPU baseline is impossible due to memory constraints—the central motivation for our approach—so direct numerical equivalence cannot be measured. For smaller models that fit on one GPU we have confirmed that the model-parallel version is numerically equivalent (up to floating-point differences in all-reduce) and exhibits no measurable impact on convergence when communication is overlapped with computation. We will revise the manuscript to include this clarification together with scaling results on smaller models run both with and without model parallelism to better quantify the communication cost. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an engineering implementation of intra-layer tensor model parallelism for transformers, achieved by inserting a small number of PyTorch communication primitives (all-reduce and split operations) without new compilers or libraries. Reported outcomes—convergence of 8.3B-parameter models on 512 GPUs, 76% weak scaling efficiency, 15.1 PetaFLOPs sustained, and SOTA perplexity/accuracy numbers on WikiText103, LAMBADA, and RACE—are presented as direct runtime measurements of the constructed system rather than predictions derived from fitted parameters or self-referential definitions. The layer-norm placement adjustment for BERT variants is an explicit design choice validated by ablation experiments, not an assumption that presupposes the final accuracy. No equations, uniqueness theorems, or ansatzes reduce the central claims to inputs by construction; the work is self-contained against external benchmarks and prior single-GPU baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Engineering paper; relies on standard assumptions that transformers benefit from scale and that PyTorch collective operations behave as documented.

axioms (1)

domain assumption Larger transformer models improve downstream NLP performance when trained to convergence
Invoked to motivate the value of reaching 8B parameters.

pith-pipeline@v0.9.0 · 5633 in / 1137 out tokens · 55865 ms · 2026-05-10T18:27:20.612974+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
We implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters... Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch.
IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear
We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs... Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets.
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
cs.LG 2026-05 conditional novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
Efficient Training on Multiple Consumer GPUs with RoundPipe
cs.DC 2026-04 conditional novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
cs.CL 2020-12 conditional novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
cs.DC 2026-05 unverdicted novelty 7.0

NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
cs.CL 2026-05 unverdicted novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
cs.CL 2026-05 unverdicted novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs
cs.DC 2026-05 unverdicted novelty 7.0

Production logs from a 504-GPU LLM training cluster show 100% failure detection via multi-metric analysis, NFS saturation limiting bandwidth to 1.4-10.4% of link speed, and auto-retry achieving 33.3% success versus 12...
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
cs.LG 2026-05 unverdicted novelty 7.0

HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.
CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure
cs.DC 2026-05 unverdicted novelty 7.0

CCL-Bench packages traces and metadata to compute detailed compute, memory, and communication efficiency metrics, surfacing performance insights unavailable from end-to-end benchmarks.
Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
cs.AR 2026-05 unverdicted novelty 7.0

CAIS delivers 1.38x end-to-end LLM training speedup over NVLS and 1.61x over T3 by making in-switch computing aware of computation memory requirements instead of treating communication as an isolated phase.
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
cs.DC 2026-05 unverdicted novelty 7.0

Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
cs.RO 2026-04 unverdicted novelty 7.0

VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...
Hierarchical Spatio-Channel Clustering for Efficient Model Compression in Medical Image Analysis
cs.CV 2026-04 unverdicted novelty 7.0

A spatio-channel clustering framework for CNN compression reduces FLOPs by 81% and raises brain tumor MRI classification accuracy from 87.76% to 89.80% compared with global SVD and Tucker baselines.
Shard the Gradient, Scale the Model: Serverless Federated Aggregation via Gradient Partitioning
cs.DC 2026-04 unverdicted novelty 7.0

GradsSharding shards gradients for serverless federated aggregation to support arbitrarily large models with identical results to traditional methods and cost savings above 500 MB gradient size.
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
cs.DC 2026-04 unverdicted novelty 7.0

FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion
cs.CL 2026-04 unverdicted novelty 7.0

TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.
PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving
cs.DC 2026-04 unverdicted novelty 7.0

PipeLive enables live pipeline parallelism reconfiguration for LLMs via KV cache redesign and VM-migration-inspired patching, cutting TTFT by 2.5x and reconfiguration time to under 10ms.
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
cs.DC 2026-04 unverdicted novelty 7.0

Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads
cs.LG 2026-04 unverdicted novelty 7.0

ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneo...
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
cs.DC 2026-04 unverdicted novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Scaling and evaluating sparse autoencoders
cs.LG 2024-06 unverdicted novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
cs.LG 2024-05 unverdicted novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Ring Attention with Blockwise Transformers for Near-Infinite Context
cs.CL 2023-10 unverdicted novelty 7.0

Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.
Efficient Memory Management for Large Language Model Serving with PagedAttention
cs.LG 2023-09 conditional novelty 7.0

PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
cs.LG 2022-05 accept novelty 7.0

FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
cs.CL 2020-06 unverdicted novelty 7.0

DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model super...
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
eess.AS 2026-05 unverdicted novelty 6.0

A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference
cs.DC 2026-05 unverdicted novelty 6.0

ChunkFlow achieves up to 1.28x step-time speedup and up to 49% lower peak GPU memory for DiT inference by using a first-order model to guide communication-aware chunked prefetching.
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
cs.DC 2026-05 unverdicted novelty 6.0

ReCoVer uses fault-tolerant collectives, in-step recovery, and dynamic microbatch redistribution to maintain training trajectory equivalence under GPU failures, delivering 2.23x higher effective throughput than checkp...
ShardTensor: Domain Parallelism for Scientific Machine Learning
cs.DC 2026-05 unverdicted novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
cs.LG 2026-05 unverdicted novelty 6.0

DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
AdaPaD: Adaptive Parallel Deflation for PEFT with Self-Correcting Rank Discovery
cs.LG 2026-05 unverdicted novelty 6.0

AdaPaD performs parallel low-rank adaptation with self-correcting deflation targets and dynamic per-module rank growth, yielding competitive GLUE and SQuAD results at 30% smaller average adapter size.
Accelerating Compound LLM Training Workloads with Maestro
cs.DC 2026-05 unverdicted novelty 6.0

Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.
Verifiable Process Rewards for Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Verifiable Process Rewards (VPR) converts symbolic oracles into dense turn-level supervision for reinforcement learning in agentic reasoning, outperforming outcome-only rewards and transferring to general benchmarks.
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
cs.LG 2026-05 unverdicted novelty 6.0

DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
cs.DC 2026-05 unverdicted novelty 6.0

MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
cs.LG 2026-05 unverdicted novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
Priming: Hybrid State Space Models From Pre-trained Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
cs.DC 2026-05 unverdicted novelty 6.0

HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
cs.LG 2026-05 unverdicted novelty 6.0

HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
cs.DC 2026-05 unverdicted novelty 6.0

ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
cs.DC 2026-05 unverdicted novelty 6.0

BalanceRoute reduces data-parallel imbalance in LLM inference via F-score routing and lookahead, yielding higher end-to-end throughput on 144-NPU clusters versus vLLM baselines.
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
cs.DC 2026-05 unverdicted novelty 6.0

BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and A...
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
cs.AR 2026-05 unverdicted novelty 6.0

MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
cs.LG 2026-05 unverdicted novelty 6.0

Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
VDCores: Resource Decoupled Programming and Execution for Asynchronous GPU
cs.DC 2026-05 unverdicted novelty 6.0

VDCores decouples GPU execution into virtual cores and micro-ops to boost asynchronous hardware use, delivering 24% average LLM inference throughput gains and 90% less programming effort.
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
cs.DC 2026-05 conditional novelty 6.0

SPECTRE delivers up to 2.28x speedup on large-model LLM inference by turning idle tail-model services into remote speculative drafters using hybrid parallel decoding and priority scheduling.
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
cs.DC 2026-05 unverdicted novelty 6.0

SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
cs.LG 2026-05 unverdicted novelty 6.0

ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training
cs.DC 2026-04 unverdicted novelty 6.0

COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
cs.AR 2026-04 unverdicted novelty 6.0

AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H...
CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training
cs.LG 2026-04 unverdicted novelty 6.0

CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training
cs.LG 2026-04 unverdicted novelty 6.0

JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
cs.LG 2026-04 unverdicted novelty 6.0

CuTile delivers high performance on select AI workloads and GPUs but varies significantly by architecture and is less portable than Triton across tested platforms.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 124 Pith papers · 9 internal anchors

[2]

Layer Normalization

URL http://arxiv.org/ abs/1607.06450. Chen, C.-C., Yang, C.-L., and Cheng, H.-Y . Efﬁcient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv:1809.02839,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Deep Nets with Sublinear Memory Cost

URL http://arxiv.org/ abs/1604.06174. Dai, Z., Yang, Z., Yang, Y ., Carbonell, J. G., Le, Q. V ., and Salakhutdinov, R. Transformer-xl: Attentive lan- guage models beyond a ﬁxed-length context. CoRR, abs/1901.02860,

work page internal anchor Pith review arXiv 1901
[5]

Transformer-xl: Attentive language models beyond a ﬁxed-length context

URL http://arxiv.org/ abs/1901.02860. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding,

work page arXiv 1901
[6]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Goyal, P., Doll ´ar, P., Girshick, R. B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y ., and He, K. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677,

work page internal anchor Pith review arXiv
[7]

arXiv preprint arXiv:1806.03377 , year=

Harlap, A., Narayanan, D., Phanishayee, A., Se- shadri, V ., Devanur, N., Ganger, G., and Gibbons, P. Pipedream: Fast and efﬁcient pipeline parallel dnn train- ing. arXiv:1806.03377,

work page arXiv
[9]

Gaussian Error Linear Units (GELUs)

URL http: //arxiv.org/abs/1606.08415. Howard, J. and Ruder, S. Fine-tuned language models for text classiﬁcation. CoRR, abs/1801.06146,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Le, and Zhifeng Chen

URL http://arxiv.org/ abs/1811.06965. Jia, Z., Zaharia, M., and Aiken, A. Beyond data and model parallelism for deep neural networks. arXiv:1807.05358,

work page arXiv
[12]

SpanBERT: Improving pre-training by representing and predicting spans

Joshi, M., Chen, D., Liu, Y ., Weld, D. S., Zettlemoyer, L., and Levy, O. Spanbert: Improving pre-training by representing and predicting spans. arXiv:1907.10529,

work page arXiv 1907
[13]

Generalization through memorization: Nearest neighbor language models

Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., and Lewis, M. Generalization through memorization: Nearest neighbor language models. arXiv:1911.00172,

work page arXiv 1911
[14]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:1704.04683 , year=

Lai, G., Xie, Q., Liu, H., Yang, Y ., and Hovy, E. Race: Large-scale reading comprehension dataset from exami- nations. arXiv:1704.04683,

work page arXiv
[16]

Lan, Z., Chen, M., Goodman, S., Gimpel, K., and Soricut, P. S. R. Albert: A lite bert for self-supervised learning of language representations. arXiv:1909.11942,

work page internal anchor Pith review arXiv 1909
[17]

Multi-task deep neural networks for natural language understanding.arXiv preprint arXiv:1901.11504, 2019b

Liu, X., He, P., Chen, W., and Gao, J. Multi-task deep neu- ral networks for natural language understanding. CoRR, abs/1901.11504, 2019a. URL http://arxiv.org/ abs/1901.11504. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Liu, Y ., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L....

work page arXiv 1901
[18]

McCann, B., Bradbury, J., Xiong, C., and Socher, R

URL https:// openreview.net/forum?id=Bkg6RiCqY7. McCann, B., Bradbury, J., Xiong, C., and Socher, R. Learned in translation: Contextualized word vectors. CoRR, abs/1708.00107,

work page arXiv
[20]

Pointer Sentinel Mixture Models

URL http://arxiv.org/abs/1609.07843. Micikevicius, P., Narang, S., Alben, J., Diamos, G. F., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., and Wu, H. Mixed precision training. CoRR, abs/1710.03740,

work page internal anchor Pith review arXiv
[21]

et al (2013)

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546,

work page arXiv
[23]

arXiv preprint arXiv:1606.06031 , year=

URL http://arxiv.org/ abs/1606.06031. Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation,

work page arXiv
[25]

arXiv preprint arXiv:1802.05365 , year=

URL http://arxiv.org/abs/1802.05365. Radford, A., J ´ozefowicz, R., and Sutskever, I. Learning to generate reviews and discovering sentiment. CoRR, abs/1704.01444,

work page arXiv
[26]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

URL https://openai.com/blog/ better-language-models/. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv:1910.10683,

work page internal anchor Pith review arXiv 1910
[28]

Liu, and Quoc V

URL http://arxiv.org/ abs/1611.02683. Shazeer, N., Cheng, Y ., Parmar, N., Tran, D., Vaswani, A., Koanantakool, P., Hawkins, P., Lee, H., Hong, M., Young, C., Sepassi, R., and Hechtman, B. Mesh-TensorFlow: Deep learning for supercomputers. In Neural Information Processing Systems,

work page arXiv
[30]

A simple method for commonsense reasoning

URL http://arxiv.org/abs/1806.02847. Turian, J., Ratinov, L., and Bengio, Y . Word representations: A simple and general method for semi-supervised learn- ing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics , ACL ’10, pp. 384–394, Stroudsburg, PA, USA,

work page arXiv
[31]

Attention Is All You Need

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. CoRR, abs/1706.03762,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Xlnet: Generalized autoregressive pretrain- ing for language understanding

URL http://arxiv.org/ abs/1906.08237. You, Y ., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks. arXiv:1708.03888,

work page arXiv 1906
[34]

In: International Conference on Learning Representations (ICLR) 2020 (2020).https://arxiv.org/abs/1904.00962

You, Y ., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojana- palli, S., Song, X., Demmel, J., and Hsieh, C.-J. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv:1904.00962,

work page arXiv 1904
[36]

Defending against neural fake news.arXiv preprint arXiv:1905.12616,

URL http: //arxiv.org/abs/1905.12616. Zhu, Y ., Kiros, R., Zemel, R. S., Salakhutdinov, R., Urta- sun, R., Torralba, A., and Fidler, S. Aligning books and movies: Towards story-like visual explanations by watch- ing movies and reading books. CoRR, abs/1506.06724,

work page arXiv 1905
[37]

Grouping of GPUs for hybrid model and data parallelism with 8-way model parallel and 64-way data parallel. C. Text Samples Below are some text samples generated by Megatron-LM using a context prompt. Some of the texts are cut short. Context: There is a major force uniting Americas ﬁercely partisan politicians: big technology companies. Democrats and Repub...

work page 2011
[38]

Megatron-LM: With a broad scope, the conference ad- dresses the challenges and opportunities in machine learning for practitioners and researchers

The conference will consist of one day of tutorials (June 10), followed by three days of main conference sessions (June 11-13), followed by two days of workshops (June 14-15). Megatron-LM: With a broad scope, the conference ad- dresses the challenges and opportunities in machine learning for practitioners and researchers. The conference program will featu...

work page 2019
[39]

and cloze-style prediction accuracy on the LAMBADA dataset(Paperno et al., 2016). E.1. Wikitext103 Perplexity WikiText103 perplexity is an evaluation criterion that has been well studied over the past few years since the creation of the benchmark dataset. Perplexity is the exponentiation of the average cross entropy of a corpus (Mikolov et al., 2011). Thi...

work page 2016