Recognition: 3 theorem links
· Lean TheoremMegatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Pith reviewed 2026-05-10 18:27 UTC · model grok-4.3
The pith
Intra-layer model parallelism in native PyTorch lets transformers scale to 8.3 billion parameters across 512 GPUs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that an intra-layer model-parallel scheme, implemented by partitioning the linear layers inside attention and feed-forward blocks and adding the necessary all-reduce and scatter/gather calls, allows transformer models to grow to 8.3 billion parameters while still converging to better final performance than prior smaller models. They further establish that careful placement of layer normalization is required for the larger BERT-like architecture to realize these gains. The approach delivers 15.1 petaflops sustained and 76 percent weak-scaling efficiency relative to a single-GPU baseline.
What carries the argument
Intra-layer model parallelism, achieved by partitioning weight matrices of each transformer sub-layer across GPUs and using a small number of collective communication primitives inside native PyTorch to reassemble activations and gradients.
If this is right
- Transformer models up to at least 8.3 billion parameters become trainable on clusters of a few hundred GPUs.
- State-of-the-art results are obtained on WikiText103 (perplexity 10.8), LAMBADA (accuracy 66.5 percent), and RACE (accuracy 90.9 percent).
- Sustained application throughput reaches 15.1 petaflops with 76 percent scaling efficiency relative to a single GPU baseline.
- The same intra-layer technique combines directly with pipeline parallelism for further scaling.
- Layer-normalization placement must be revisited when model depth and width both increase.
Where Pith is reading between the lines
- The same splitting pattern could be applied to other attention-based architectures beyond the GPT and BERT families tested here.
- If communication overhead remains tolerable at still larger scales, the method lowers the barrier to experimenting with models whose size was previously limited by single-node memory.
- Combining this intra-layer scheme with data parallelism would allow training even bigger models on larger GPU counts without rewriting the core training loop.
- The observed need to adjust normalization placement hints that other architectural details may also require re-tuning once models exceed a few billion parameters.
Load-bearing premise
The extra communication steps inside each layer add only acceptable overhead and leave the optimizer's convergence behavior unchanged at billion-parameter scale.
What would settle it
A side-by-side run of the identical 8.3-billion-parameter model with and without the intra-layer splits, measuring both wall-clock time per step and final validation perplexity or accuracy, would show whether the added communication measurably slows training or lowers final quality.
read the original abstract
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes an intra-layer model parallelism technique for training large transformer models in native PyTorch. It shows how to train models with up to 8.3 billion parameters using 512 GPUs, achieving 15.1 PetaFLOPs with 76% scaling efficiency relative to a single-GPU baseline. The work also reports state-of-the-art performance on the WikiText103, LAMBADA, and RACE datasets using GPT-2-like and BERT-like models, with specific improvements attributed to layer normalization placement in the latter.
Significance. If validated, these results would be significant for the field as they provide a straightforward way to scale transformer training beyond single-GPU memory limits without requiring new compilers or libraries. The emphasis on implementation simplicity and the reported benchmark gains underscore the practical value of model parallelism for advancing NLP capabilities through larger models. The provision of PyTorch-level details aids reproducibility.
major comments (2)
- Experiments section: The reported SOTA results (WikiText103 perplexity of 10.8 vs. prior 15.8; LAMBADA accuracy of 66.5% vs. prior 63.2%) are presented as single point estimates without error bars, standard deviations, or details on the number of independent runs, which is load-bearing for confirming these constitute genuine advances rather than run-specific outcomes.
- Scaling efficiency results: The 76% weak scaling efficiency to 512 GPUs is central to the claim of acceptable overhead, yet the manuscript provides no ablation isolating the added communication primitives' impact on convergence speed, final accuracy, or numerical equivalence to a single-GPU model at the 8.3B scale.
minor comments (2)
- Abstract: The mention of 'careful attention to the placement of layer normalization in BERT-like models' is important for the 3.9B model results but lacks a concrete description or pseudocode of the modification, reducing clarity for readers attempting to replicate the performance gain.
- Implementation details: Additional diagrams or pseudocode showing the precise insertion of the communication operations (e.g., all-reduce) within the attention and feed-forward layers would improve the reproducibility of the intra-layer parallelism approach.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation for minor revision. We address the two major comments point by point below.
read point-by-point responses
-
Referee: Experiments section: The reported SOTA results (WikiText103 perplexity of 10.8 vs. prior 15.8; LAMBADA accuracy of 66.5% vs. prior 63.2%) are presented as single point estimates without error bars, standard deviations, or details on the number of independent runs, which is load-bearing for confirming these constitute genuine advances rather than run-specific outcomes.
Authors: We acknowledge that multiple independent runs with error bars would provide stronger statistical support for the SOTA claims. However, the extreme computational cost of training 8.3B-parameter models (requiring hundreds of GPUs over extended periods) renders repeated full runs impractical, a constraint shared by prior large-scale works such as the original GPT-2 and BERT papers. The reported gains substantially exceed prior SOTA by wide margins, and our hyperparameter search and training procedures are detailed in the manuscript. In the revised version we will add a discussion in the Experiments section noting the single-run nature of the results and the resource limitations that preclude multiple runs. revision: partial
-
Referee: Scaling efficiency results: The 76% weak scaling efficiency to 512 GPUs is central to the claim of acceptable overhead, yet the manuscript provides no ablation isolating the added communication primitives' impact on convergence speed, final accuracy, or numerical equivalence to a single-GPU model at the 8.3B scale.
Authors: We agree that further isolation of communication overhead would be useful. At the 8.3B scale a single-GPU baseline is impossible due to memory constraints—the central motivation for our approach—so direct numerical equivalence cannot be measured. For smaller models that fit on one GPU we have confirmed that the model-parallel version is numerically equivalent (up to floating-point differences in all-reduce) and exhibits no measurable impact on convergence when communication is overlapped with computation. We will revise the manuscript to include this clarification together with scaling results on smaller models run both with and without model parallelism to better quantify the communication cost. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper describes an engineering implementation of intra-layer tensor model parallelism for transformers, achieved by inserting a small number of PyTorch communication primitives (all-reduce and split operations) without new compilers or libraries. Reported outcomes—convergence of 8.3B-parameter models on 512 GPUs, 76% weak scaling efficiency, 15.1 PetaFLOPs sustained, and SOTA perplexity/accuracy numbers on WikiText103, LAMBADA, and RACE—are presented as direct runtime measurements of the constructed system rather than predictions derived from fitted parameters or self-referential definitions. The layer-norm placement adjustment for BERT variants is an explicit design choice validated by ablation experiments, not an assumption that presupposes the final accuracy. No equations, uniqueness theorems, or ansatzes reduce the central claims to inputs by construction; the work is self-contained against external benchmarks and prior single-GPU baselines.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Larger transformer models improve downstream NLP performance when trained to convergence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclearWe implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters... Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch.
-
IndisputableMonolith.Foundation.RealityFromDistinctionreality_from_one_distinction unclearWe illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs... Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets.
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclearWe sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs.
Forward citations
Cited by 60 Pith papers
-
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
-
Efficient Training on Multiple Consumer GPUs with RoundPipe
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
-
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
-
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
-
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
-
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
-
From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs
Production logs from a 504-GPU LLM training cluster show 100% failure detection via multi-metric analysis, NFS saturation limiting bandwidth to 1.4-10.4% of link speed, and auto-retry achieving 33.3% success versus 12...
-
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.
-
CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure
CCL-Bench packages traces and metadata to compute detailed compute, memory, and communication efficiency metrics, surfacing performance insights unavailable from end-to-end benchmarks.
-
Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
CAIS delivers 1.38x end-to-end LLM training speedup over NVLS and 1.61x over T3 by making in-switch computing aware of computation memory requirements instead of treating communication as an isolated phase.
-
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
-
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...
-
Hierarchical Spatio-Channel Clustering for Efficient Model Compression in Medical Image Analysis
A spatio-channel clustering framework for CNN compression reduces FLOPs by 81% and raises brain tumor MRI classification accuracy from 87.76% to 89.80% compared with global SVD and Tucker baselines.
-
Shard the Gradient, Scale the Model: Serverless Federated Aggregation via Gradient Partitioning
GradsSharding shards gradients for serverless federated aggregation to support arbitrarily large models with identical results to traditional methods and cost savings above 500 MB gradient size.
-
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
-
Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion
TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.
-
PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving
PipeLive enables live pipeline parallelism reconfiguration for LLMs via KV cache redesign and VM-migration-inspired patching, cutting TTFT by 2.5x and reconfiguration time to under 10ms.
-
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
-
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads
ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneo...
-
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Ring Attention with Blockwise Transformers for Near-Infinite Context
Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
-
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model super...
-
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
-
ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference
ChunkFlow achieves up to 1.28x step-time speedup and up to 49% lower peak GPU memory for DiT inference by using a first-order model to guide communication-aware chunked prefetching.
-
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
ReCoVer uses fault-tolerant collectives, in-step recovery, and dynamic microbatch redistribution to maintain training trajectory equivalence under GPU failures, delivering 2.23x higher effective throughput than checkp...
-
ShardTensor: Domain Parallelism for Scientific Machine Learning
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
-
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
-
AdaPaD: Adaptive Parallel Deflation for PEFT with Self-Correcting Rank Discovery
AdaPaD performs parallel low-rank adaptation with self-correcting deflation targets and dynamic per-module rank growth, yielding competitive GLUE and SQuAD results at 30% smaller average adapter size.
-
Accelerating Compound LLM Training Workloads with Maestro
Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.
-
Verifiable Process Rewards for Agentic Reasoning
Verifiable Process Rewards (VPR) converts symbolic oracles into dense turn-level supervision for reinforcement learning in agentic reasoning, outperforming outcome-only rewards and transferring to general benchmarks.
-
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
-
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
-
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
-
Priming: Hybrid State Space Models From Pre-trained Transformers
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
-
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.
-
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
-
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
-
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.
-
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
BalanceRoute reduces data-parallel imbalance in LLM inference via F-score routing and lookahead, yielding higher end-to-end throughput on 144-NPU clusters versus vLLM baselines.
-
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and A...
-
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
-
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
-
VDCores: Resource Decoupled Programming and Execution for Asynchronous GPU
VDCores decouples GPU execution into virtual cores and micro-ops to boost asynchronous hardware use, delivering 24% average LLM inference throughput gains and 90% less programming effort.
-
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
SPECTRE delivers up to 2.28x speedup on large-model LLM inference by turning idle tail-model services into remote speculative drafters using hybrid parallel decoding and priority scheduling.
-
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.
-
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
-
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training
COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
-
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H...
-
CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training
CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
-
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training
JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
-
Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
CuTile delivers high performance on select AI workloads and GPUs but varies significantly by architecture and is less portable than Triton across tested platforms.
Reference graph
Works this paper leans on
-
[2]
URL http://arxiv.org/ abs/1607.06450. Chen, C.-C., Yang, C.-L., and Cheng, H.-Y . Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv:1809.02839,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Training Deep Nets with Sublinear Memory Cost
URL http://arxiv.org/ abs/1604.06174. Dai, Z., Yang, Z., Yang, Y ., Carbonell, J. G., Le, Q. V ., and Salakhutdinov, R. Transformer-xl: Attentive lan- guage models beyond a fixed-length context. CoRR, abs/1901.02860,
work page internal anchor Pith review arXiv 1901
-
[5]
Transformer-xl: Attentive language models beyond a fixed-length context
URL http://arxiv.org/ abs/1901.02860. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding,
-
[6]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Goyal, P., Doll ´ar, P., Girshick, R. B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y ., and He, K. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677,
work page internal anchor Pith review arXiv
-
[7]
arXiv preprint arXiv:1806.03377 , year=
Harlap, A., Narayanan, D., Phanishayee, A., Se- shadri, V ., Devanur, N., Ganger, G., and Gibbons, P. Pipedream: Fast and efficient pipeline parallel dnn train- ing. arXiv:1806.03377,
-
[9]
Gaussian Error Linear Units (GELUs)
URL http: //arxiv.org/abs/1606.08415. Howard, J. and Ruder, S. Fine-tuned language models for text classification. CoRR, abs/1801.06146,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
URL http://arxiv.org/ abs/1811.06965. Jia, Z., Zaharia, M., and Aiken, A. Beyond data and model parallelism for deep neural networks. arXiv:1807.05358,
-
[12]
SpanBERT: Improving pre-training by representing and predicting spans
Joshi, M., Chen, D., Liu, Y ., Weld, D. S., Zettlemoyer, L., and Levy, O. Spanbert: Improving pre-training by representing and predicting spans. arXiv:1907.10529,
-
[13]
Generalization through memorization: Nearest neighbor language models
Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., and Lewis, M. Generalization through memorization: Nearest neighbor language models. arXiv:1911.00172,
-
[14]
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
arXiv preprint arXiv:1704.04683 , year=
Lai, G., Xie, Q., Liu, H., Yang, Y ., and Hovy, E. Race: Large-scale reading comprehension dataset from exami- nations. arXiv:1704.04683,
-
[16]
Lan, Z., Chen, M., Goodman, S., Gimpel, K., and Soricut, P. S. R. Albert: A lite bert for self-supervised learning of language representations. arXiv:1909.11942,
work page internal anchor Pith review arXiv 1909
-
[17]
Liu, X., He, P., Chen, W., and Gao, J. Multi-task deep neu- ral networks for natural language understanding. CoRR, abs/1901.11504, 2019a. URL http://arxiv.org/ abs/1901.11504. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Liu, Y ., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L....
-
[18]
McCann, B., Bradbury, J., Xiong, C., and Socher, R
URL https:// openreview.net/forum?id=Bkg6RiCqY7. McCann, B., Bradbury, J., Xiong, C., and Socher, R. Learned in translation: Contextualized word vectors. CoRR, abs/1708.00107,
-
[20]
Pointer Sentinel Mixture Models
URL http://arxiv.org/abs/1609.07843. Micikevicius, P., Narang, S., Alben, J., Diamos, G. F., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., and Wu, H. Mixed precision training. CoRR, abs/1710.03740,
work page internal anchor Pith review arXiv
-
[21]
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546,
-
[23]
arXiv preprint arXiv:1606.06031 , year=
URL http://arxiv.org/ abs/1606.06031. Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation,
-
[25]
arXiv preprint arXiv:1802.05365 , year=
URL http://arxiv.org/abs/1802.05365. Radford, A., J ´ozefowicz, R., and Sutskever, I. Learning to generate reviews and discovering sentiment. CoRR, abs/1704.01444,
-
[26]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
URL https://openai.com/blog/ better-language-models/. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv:1910.10683,
work page internal anchor Pith review arXiv 1910
-
[28]
URL http://arxiv.org/ abs/1611.02683. Shazeer, N., Cheng, Y ., Parmar, N., Tran, D., Vaswani, A., Koanantakool, P., Hawkins, P., Lee, H., Hong, M., Young, C., Sepassi, R., and Hechtman, B. Mesh-TensorFlow: Deep learning for supercomputers. In Neural Information Processing Systems,
-
[30]
A simple method for commonsense reasoning
URL http://arxiv.org/abs/1806.02847. Turian, J., Ratinov, L., and Bengio, Y . Word representations: A simple and general method for semi-supervised learn- ing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics , ACL ’10, pp. 384–394, Stroudsburg, PA, USA,
-
[31]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. CoRR, abs/1706.03762,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Xlnet: Generalized autoregressive pretrain- ing for language understanding
URL http://arxiv.org/ abs/1906.08237. You, Y ., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks. arXiv:1708.03888,
-
[34]
You, Y ., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojana- palli, S., Song, X., Demmel, J., and Hsieh, C.-J. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv:1904.00962,
-
[36]
Defending against neural fake news.arXiv preprint arXiv:1905.12616,
URL http: //arxiv.org/abs/1905.12616. Zhu, Y ., Kiros, R., Zemel, R. S., Salakhutdinov, R., Urta- sun, R., Torralba, A., and Fidler, S. Aligning books and movies: Towards story-like visual explanations by watch- ing movies and reading books. CoRR, abs/1506.06724,
-
[37]
Grouping of GPUs for hybrid model and data parallelism with 8-way model parallel and 64-way data parallel. C. Text Samples Below are some text samples generated by Megatron-LM using a context prompt. Some of the texts are cut short. Context: There is a major force uniting Americas fiercely partisan politicians: big technology companies. Democrats and Repub...
work page 2011
-
[38]
The conference will consist of one day of tutorials (June 10), followed by three days of main conference sessions (June 11-13), followed by two days of workshops (June 14-15). Megatron-LM: With a broad scope, the conference ad- dresses the challenges and opportunities in machine learning for practitioners and researchers. The conference program will featu...
work page 2019
-
[39]
and cloze-style prediction accuracy on the LAMBADA dataset(Paperno et al., 2016). E.1. Wikitext103 Perplexity WikiText103 perplexity is an evaluation criterion that has been well studied over the past few years since the creation of the benchmark dataset. Perplexity is the exponentiation of the average cross entropy of a corpus (Mikolov et al., 2011). Thi...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.