arxiv: 2604.05091 · v1 · submitted 2026-04-06 · 💻 cs.CL · cs.DC· cs.OS

Recognition: no theorem link

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Zhengqing Yuan , Hanchi Sun , Lichao Sun , Yanfang Ye

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:53 UTC · model grok-4.3

classification 💻 cs.CL cs.DCcs.OS

keywords large language modelssingle GPU trainingmemory optimizationCPU memory streamingfull precisionparameter offloadingLLM trainingmodel scaling

0 comments

The pith

A memory-centric system trains 120 billion parameter language models at full precision on a single GPU by streaming parameters from CPU memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to train extremely large language models without loading the full model onto the GPU at once. All parameters and optimizer states live in the computer's main memory, and the GPU receives only the data needed for the current layer before sending gradients back. A pipelined engine overlaps the movement of data with the actual calculations, while dynamic layer structures avoid locking in fixed computation plans. This setup allows training to proceed continuously even when the GPU cannot hold the entire model. If the approach holds, it removes the need for multiple GPUs or reduced precision to handle models that exceed normal device memory limits.

Core claim

By storing parameters and optimizer states in host memory and treating the GPU solely as a transient compute engine, the system streams parameters in for each layer and computes gradients out, minimizing persistent device state. A pipelined double-buffered execution engine overlaps parameter prefetching, computation, and gradient offloading across multiple streams to hide transfer costs, while stateless layer templates bind weights dynamically upon arrival to remove persistent autograd graph metadata. This combination enables reliable full-precision training of models up to 120 billion parameters on a single GPU equipped with 1.5 terabytes of host memory.

What carries the argument

Pipelined double-buffered execution engine that overlaps prefetching, computation, and offloading, paired with stateless layer templates that bind weights dynamically as they arrive.

If this is right

Training runs become possible for models up to 120 billion parameters using only one GPU and large host memory.
Throughput for 14 billion parameter models rises to nearly twice that of standard CPU offloading techniques.
Models with 7 billion parameters can train using context lengths of 512 thousand tokens on a single GPU.
Persistent storage of model data on the GPU is no longer required, freeing device memory for larger batches or longer sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Researchers without access to large GPU clusters could experiment with much bigger models using this streaming approach.
Advances in faster CPU-GPU links would likely push the maximum trainable size even higher under the same design.
Training frameworks might shift toward treating GPUs as temporary processors rather than permanent holders of model state.
The method could support deployment of large-model training in settings where only one high-memory GPU is available.

Load-bearing premise

The CPU-GPU connection must supply enough sustained bandwidth for the pipelined engine to keep computation running without pauses or stability problems when models reach 100 billion parameters.

What would settle it

A complete training run of a 120 billion parameter model on one GPU with 1.5 terabytes of host memory that finishes without numerical instability or unexpected slowdowns would confirm the central claim.

Figures

Figures reproduced from arXiv: 2604.05091 by Hanchi Sun, Lichao Sun, Yanfang Ye, Zhengqing Yuan.

**Figure 2.** Figure 2: MegaTrain architecture: CPU serves as the parame [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: End-to-end pipelined execution. Weight prefetch ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Host (CPU) memory footprint versus model scale [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Depth scalability with fixed model width (hidden and FFN) size. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Width scalability with fixed model layers [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Performance comparison on a single A100 PCIe [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84$\times$ the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MegaTrain streams layers from host memory with double-buffered pipelining and stateless templates to train 120B models on one H200, a practical systems tweak that beats DeepSpeed on 14B but rests on thin validation for the big scale.

read the letter

The paper's main move is to keep parameters and optimizer states in 1.5 TB host memory and treat the GPU purely as a transient compute unit. For each layer it prefetches weights, runs forward and backward, then offloads gradients, all overlapped via a double-buffered engine across CUDA streams and dynamic binding of stateless layer templates instead of persistent autograd graphs. This reportedly lets them train up to 120B FP32 models on a single H200 and delivers 1.84 times the throughput of DeepSpeed ZeRO-3 with CPU offloading on 14B models, plus 512k context for 7B on GH200. The engineering synthesis is coherent and directly targets the bandwidth bottleneck that usually kills offloading approaches. The concrete numbers for the 14B case and the long-context result are useful data points. The soft spot is the headline 120B claim. The abstract states it works reliably but gives no per-layer timing breakdowns, no ablation on the pipelining or template binding, and no evidence that the overlap stays perfect when layer transfers grow and the compute-to-transfer ratio changes. The stress-test worry about sustained interconnect bandwidth and possible stalls or synchronization overhead at that scale is not addressed in the provided text. This is the sort of paper systems researchers and groups trying to run large models on modest hardware would read for the implementation tricks. It is not a theoretical advance, but the problem it attacks is real. I would bring it to a reading group to discuss the overlap mechanics once the full experiments are available. It deserves peer review because the scale target and the reported speedups are worth checking in detail.

Referee Report

3 major / 2 minor

Summary. The paper introduces MegaTrain, a memory-centric system for full-precision training of 100B+ parameter LLMs on a single GPU. Parameters and optimizer states reside in host memory and are streamed to the GPU per layer; a pipelined double-buffered engine overlaps prefetching, forward/backward passes, and gradient offloading across CUDA streams, while stateless layer templates replace persistent autograd graphs with dynamic weight binding. On an H200 with 1.5 TB host memory the system is claimed to train models up to 120 B parameters and to deliver 1.84× the throughput of DeepSpeed ZeRO-3 (with CPU offloading) for 14 B models; an additional result is 7 B training at 512 k context on GH200.

Significance. If the overlap and stability claims hold at 100 B+ scale, the work would meaningfully lower the hardware barrier for full-precision training of very large models, allowing single-GPU setups with large host memory to replace multi-GPU clusters for certain workloads. The concrete scale (120 B) and throughput (1.84×) numbers, together with the two concrete optimizations, constitute a practical systems contribution that could influence future memory-centric training frameworks.

major comments (3)

[Results] Results section: the 1.84× throughput figure is reported only for 14 B models; the headline claim of reliable 120 B training provides no corresponding throughput numbers, per-layer timing breakdown, or measured overlap efficiency, leaving the central scaling assumption (full hiding of several-GB layer transfers behind compute) unverified.
[System Design] System Design / Execution Engine: the pipelined double-buffered engine and stateless templates are described at a high level, yet no quantitative analysis or micro-benchmark is given for residual host-memory latency, stream-synchronization overhead, or autograd-binding cost when layer sizes reach several GB at 120 B scale; these costs directly determine whether the overlap assumption holds.
[Evaluation] Evaluation: the manuscript states concrete throughput and scale numbers without error bars, ablation studies, number of runs, or a detailed experimental protocol (model architecture, optimizer settings, exact host-memory bandwidth measurements), so the reliability claim for 120 B models rests on unverified assertions.

minor comments (2)

[Abstract] The abstract and introduction use the phrase 'reliably trains' without defining a reliability metric or reporting any stability or convergence statistics.
[System Design] Notation for the double-buffered engine (streams, buffers, binding) is introduced without a small diagram or pseudocode, making the overlap logic harder to follow.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas where additional evidence and clarity would strengthen the paper. We address each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [Results] Results section: the 1.84× throughput figure is reported only for 14 B models; the headline claim of reliable 120 B training provides no corresponding throughput numbers, per-layer timing breakdown, or measured overlap efficiency, leaving the central scaling assumption (full hiding of several-GB layer transfers behind compute) unverified.

Authors: We agree that throughput numbers are provided only for the 14B model to enable direct comparison against DeepSpeed ZeRO-3. The 120B experiments were designed primarily to establish feasibility of full-precision training at that scale on a single GPU without OOM errors, which was verified by successfully executing multiple training steps. In the revision we will add per-layer timing breakdowns and overlap-efficiency measurements for the 14B case, together with a scaling analysis that explains how the double-buffered pipelining continues to hide transfers as layer sizes grow. Full end-to-end throughput figures for 120B were not collected because of the prohibitive wall-clock time required for such runs. revision: partial
Referee: [System Design] System Design / Execution Engine: the pipelined double-buffered engine and stateless templates are described at a high level, yet no quantitative analysis or micro-benchmark is given for residual host-memory latency, stream-synchronization overhead, or autograd-binding cost when layer sizes reach several GB at 120 B scale; these costs directly determine whether the overlap assumption holds.

Authors: The system-design section currently emphasizes the architectural approach. We will add a new appendix containing micro-benchmark results that quantify host-memory transfer latency, CUDA-stream synchronization cost, and dynamic-binding overhead for layers of several GB. These measurements were obtained on the same H200 platform and demonstrate that the pipelined execution continues to overlap communication with computation at the layer sizes encountered in 120B models. revision: yes
Referee: [Evaluation] Evaluation: the manuscript states concrete throughput and scale numbers without error bars, ablation studies, number of runs, or a detailed experimental protocol (model architecture, optimizer settings, exact host-memory bandwidth measurements), so the reliability claim for 120 B models rests on unverified assertions.

Authors: We will expand the evaluation section with a complete experimental protocol that specifies model architectures (exact hidden sizes, number of layers, and attention heads for both 14B and 120B configurations), optimizer hyperparameters (AdamW betas, epsilon, weight decay, and learning-rate schedule), and host-memory bandwidth figures measured with standard CUDA bandwidthTest utilities. Throughput values will be reported with standard deviations and the number of repeated runs (five or more). Ablation studies isolating the double-buffered engine and the stateless-template optimization will also be included. revision: yes

standing simulated objections not resolved

Full end-to-end throughput and overlap-efficiency numbers for the 120B model are not available, as collecting them would require impractically long runs on the target hardware.

Circularity Check

0 steps flagged

No circularity: systems implementation grounded in hardware measurements

full rationale

The paper presents a systems engineering contribution for memory-offloaded LLM training. It introduces a pipelined double-buffered engine and stateless layer templates as implementation techniques, but contains no equations, derivations, fitted parameters, or self-referential definitions. All performance claims (e.g., 1.84× throughput on 14B models, reliable 120B training) are stated as direct outcomes of hardware benchmarks rather than reductions to prior results or internal fits. No load-bearing self-citations or uniqueness theorems appear; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The contribution is an engineering system that rests on domain assumptions about hardware behavior rather than new mathematical free parameters or postulated entities.

axioms (2)

domain assumption CUDA streams and host-device transfers can be overlapped to achieve near-continuous GPU utilization when data movement is the bottleneck.
Invoked to justify the double-buffered pipelined engine.
domain assumption Stateless layer templates can be bound to streamed weights without altering gradient correctness or introducing numerical instability.
Central to replacing persistent autograd graphs.

pith-pipeline@v0.9.0 · 5521 in / 1458 out tokens · 83786 ms · 2026-05-10T18:53:16.218990+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Jiarui Fang and Yang You. 2022. Meet Gemini: The Heterogeneous Memory Manager of Colossal-AI. Colossal-AI documentation. [Online]. Available: https://colossalai.org/docs/advanced_tutorials/meet_gemini

2022
[2]

GPUsPerStudent.org. 2025. H100-Equivalent GPUs Per CS Student: Tracking Academic GPU Compute Availability in the United States. Project website. [Online]. Available: https://www.gpusperstudent.org/

2025
[3]

Hennessy and David A

John L. Hennessy and David A. Patterson. 2011.Computer Architecture: A Quan- titative Approach(5th ed.). Morgan Kaufmann, San Francisco, CA

2011
[4]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Op- timization. International Conference on Learning Representations. [Online]. Available: https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015
[5]

Hanyu Lai, Xiao Liu, Junjie Gao, Jiale Cheng, Zehan Qi, Yifan Xu, Shuntian Yao, Dan Zhang, Jinhua Du, Zhenyu Hou, et al. 2025. A survey of post-training scaling in large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Stroudsburg,...

2025
[6]

Changyue Liao, Mo Sun, Zihan Yang, Jun Xie, Kaiqi Chen, Binhang Yuan, Fei Wu, and Zeke Wang. 2025. Ratel: Optimizing Holistic Data Movement to Fine-Tune 100B Model on a Consumer GPU. In2025 IEEE 41st International Conference on Data Engineering (ICDE). IEEE Press, Los Alamitos, CA, USA, 292–306. doi:10. 1109/ICDE65448.2025.00029

work page arXiv 2025
[7]

Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. Inter- national Conference on Learning Representations. [Online]. Available: https://openreview.net/forum?id=r1gs9JgRZ

2018
[8]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. InSC20: Inter- national Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, IEEE Press, Los Alamitos, CA, USA, 1–16

2020
[9]

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. InProceedings of the international conference for high performance com- puting, networking, storage and analysis. Association for Computing Machinery, New York, NY, USA, 1–14

2021
[10]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al . 2025. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534. [Online]. Available: https://arxiv.org/abs/2507.20534

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, et al. 2025. A survey on post-training of large language models. arXiv preprint arXiv:2503.06072. [Online]. Available: https://arxiv.org/abs/2503.06072

work page arXiv 2025
[12]

Yanfang Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kaiwen Shi, Yijun Ma, Wei Song, et al . 2025. LLMs4All: A Re- view of Large Language Models Across Academic Disciplines. arXiv preprint arXiv:2509.19580. [Online]. Available: https://arxiv.org/abs/2509.19580

work page arXiv 2025
[13]

Manjiang Yu and Priyanka Singh. 2025. Differentially Private Fine-Tuning of Large Language Models: A Survey. InInternational Conference on Advanced Data Mining and Applications. Springer, Springer, Cham, Switzerland, 100–113

2025
[14]

Zhengqing Yuan, Weixiang Sun, Yixin Liu, Huichi Zhou, Rong Zhou, Yiyang Li, Zheyuan Zhang, Wei Song, Yue Huang, Haolong Jia, et al. 2025. EfficientLLM: Efficiency in Large Language Models. arXiv preprint arXiv:2505.13840. [Online]. Available: https://arxiv.org/abs/2505.13840

work page arXiv 2025
[15]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al . 2023. Py- torch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277. [Online]. Available: https://arxiv.org/abs/2304.11277. A Implementation Details MegaTrain is implemented as a hig...

work page internal anchor Pith review arXiv 2023